natmin322 commited on Mar 26

Commit

bd400be

1 Parent(s): a555ead

new v2

Browse files

Files changed (25) hide show

improve_gainlora/IDEA_Overall.md +351 -402
improve_gainlora/IDEA_Overall.md.bak +337 -0
improve_gainlora/RUN_GUIDE_DIAGNOSTIC.md +208 -0
improve_gainlora/T5_small/gen_script_long_order3_t5_small_specroute.sh +60 -0
improve_gainlora/T5_small/gen_script_long_order3_t5_small_specroute_v10a.sh +60 -0
improve_gainlora/T5_small/gen_script_long_order3_t5_small_specroute_v10b.sh +60 -0
improve_gainlora/T5_small/gen_script_long_order3_t5_small_specroute_v11.sh +60 -0
improve_gainlora/T5_small/gen_script_long_order4_t5_small_specroute.sh +60 -0
improve_gainlora/T5_small/gen_script_superni_order1_t5_small_specroute.sh +60 -0
improve_gainlora/T5_small/gen_script_superni_order2_t5_small_specroute.sh +60 -0
improve_gainlora/_patch_cpi_oap.py +72 -0
improve_gainlora/analyze_diagnostics.py +205 -0
improve_gainlora/gen_script_long_order3_t5_specroute.sh +60 -0
improve_gainlora/gen_script_long_order4_t5_specroute.sh +60 -0
improve_gainlora/gen_script_superni_order1_llama_specroute.sh +60 -0
improve_gainlora/gen_script_superni_order1_llama_specroute_p100.sh +60 -0
improve_gainlora/gen_script_superni_order1_t5_specroute.sh +60 -0
improve_gainlora/gen_script_superni_order2_llama_specroute.sh +60 -0
improve_gainlora/gen_script_superni_order2_llama_specroute_p100.sh +60 -0
improve_gainlora/gen_script_superni_order2_t5_specroute.sh +60 -0
improve_gainlora/improve_gainlora.tex +231 -0
improve_gainlora/src/cl_trainer_specroute.py +239 -33
improve_gainlora/src/run_t5.py +47 -1
improve_gainlora/src/t5_specroute.py +7 -0
root_gainlora/root_gainlora.tex +199 -0

improve_gainlora/IDEA_Overall.md CHANGED Viewed

@@ -1,603 +1,552 @@
-# SpecRoute: Định tuyến Phổ Dẫn dắt bởi Dữ liệu trong Học Liên tục với LoRA
-> **Tài liệu thiết kế chính thức — V8**
-> Ràng buộc: Zero-replay nghiêm ngặt. Theory-first. Tổng hợp sau V2–V7.
 ---
-## 1. Đặt bài toán
-**Setting.** Học liên tục với LoRA mở rộng trên một LLM đóng băng.
-Các tasks $\mathcal{T}_1, \ldots, \mathcal{T}_T$ đến tuần tự. Với mỗi task $t$:
-- Một adapter low-rank $\Delta W_t = B_t A_t$ ($A_t \in \mathbb{R}^{r \times d}$, $B_t \in \mathbb{R}^{d \times r}$) được thêm vào mọi phép chiếu attention.
-- Chỉ $B_t$ được huấn luyện; $A_t$ bị đóng băng sau khi khởi tạo null-space (InfLoRA).
-- Sau khi huấn luyện, cả $A_t, B_t$ đều bị đóng băng và một nhánh mới được tạo cho task tiếp theo.
-**Inference.** Cho input $x$ *không có* task identifier, mô hình phải tạo ra output đúng:
-$$y = f\!\Bigl(W_0\, x \;+\; \sum_{t=1}^{T} w_t(x)\; B_t A_t\, x\Bigr)$$
-**Ba bài toán con kết hợp:**
-| Bài toán con | Mục tiêu | Yêu cầu hình thức |
-|:------------:|---------|-------------------|
-| **Routing (R)** | Gán input đúng expert | $w_{t^*}(x) \gg w_t(x)$ với $t \neq t^*$ |
-| **Protection (P)** | Ngăn suy giảm expert cũ | $\Delta W_t$ không đổi sau task $t$ |
-| **Allocation (A)** | Quản lý capacity không gian con hữu hạn | $\sum_t \dim\bigl(\mathrm{span}(A_t)\bigr) \leq d$ |
-**Ràng buộc setting:** *Zero-replay* — không tái sử dụng dữ liệu task cũ dưới bất kỳ hình thức nào (thô, synthetic, hay phân phối thống kê).
 ---
-## 2. Quan sát: Duality Ẩn
-### 2.1 Tiếp cận của GainLoRA và Điểm yếu
-GainLoRA (NeurIPS 2025) xử lý R, P, A như các bài toán **độc lập**:
-| Khía cạnh | Cơ chế | Chi phí |
-|-----------|--------|---------|
-| R | MLP `trans_input` học được + `prompt_key` học được → cosine gating | Tham số thêm + subspace GPM |
-| P | GPM chiếu gradient vào null-space của task cũ | Tiêu thụ subspace mỗi task |
-| A | Threshold tăng dần $\varepsilon_t \nearrow 1$ | Task sau bị ràng buộc hơn |
-**Điểm yếu cơ bản:** Vì routing được *học*, nó tạo ra vòng lặp xấu:
-1. `trans_input` thay đổi mỗi task → routing space trôi dạt → prompt keys cũ mất alignment → routing suy giảm.
-2. GPM phải bảo vệ routing params → *tiêu thụ subspace có thể dùng cho task learning*.
-3. KL distillation trên routing cần thiết → yêu cầu replay hoặc frozen copies → overhead bộ nhớ.
-### 2.2 Nhận thức then chốt
-Chúng tôi quan sát rằng GPM đảm bảo xấp xỉ các subspace input của expert trực giao nhau:
-$$\mathrm{span}(V_i) \;\approx\perp\; \mathrm{span}(V_j), \qquad i \neq j$$
-trong đó $V_t$ là các right singular vectors của $\Delta W_t$. Tính trực giao này, được đảm bảo cho mục đích **bảo vệ**, đồng thời cung cấp tiêu chí **routing** tự nhiên: vì các subspace không chồng chéo, đo lường mức độ một input căn chỉnh với từng subspace sẽ xác định duy nhất task gốc.
-> **Duality Định tuyến–Bảo vệ.**
-> Chống quên (bảo vệ subspace trực giao) và nhận diện task (routing phân biệt)
-> là *hai biểu hiện kép của cùng một cấu trúc phổ*.
-> Giải quyết một bài toán tự động giải quyết bài toán kia.
-**Hệ quả:**
-- Không cần tham số routing học được → không trôi dạt routing, không tốn GPM cho routing.
-- Không cần replay để duy trì routing → tự nhiên tuân thủ zero-replay.
-- Độ chính xác routing được *đảm bảo* bởi chất lượng bảo vệ (được hình thức hoá bên dưới).
-### 2.3 GPM–Routing Paradox: Duality sụp đổ với A ngẫu nhiên
-**Đây là phát hiện cốt lõi giải thích V2–V6 thất bại (AP ≈ 27–40).**
-Routing đo $\alpha_t(h) \propto \|A_t h\|^2$. Để phân biệt task $t^*$ với $s$, cần $A_{t^*}$ align với input $h \sim p_{t^*}$.
-**Nghịch lý:** InfLoRA khởi tạo $A_t$ **ngẫu nhiên** (Kaiming) rồi chiếu vào null-space. Kết quả:
-1. $\text{rowspace}(B_tA_t) \subseteq \text{rowspace}(A_t)$ — rowspace không mở rộng qua phép nhân trái.
-2. $\text{rowspace}(A_t) = r$ hướng **ngẫu nhiên** trong $d'$-chiều null-space (không encode task identity).
-3. Với same-domain tasks (yelp/amazon/sst2/imdb): input $h$ có variance theo hướng CHUNG → signatures không phân biệt được.
-**Empirical:** V6 IMDB (task 8): EM = 0.0 suốt 10 epochs dù training loss giảm — routing inference gán sai expert. V3 RTE, MNLI cũng tương tự. Root cause = không phải null-space exhaustion mà là **routing signal không có trong random $A_t$**.
-### 2.4 Lối Thoát: Differential Projection via Data-Informed Init
-**Lemma (sẽ chứng minh ở §3.5):** Do InfLoRA constraint $A_t P_{t-1} = 0$:
-$$\|A_t h\|^2 \;=\; \|A_t Q_{t-1} h\|^2 \quad \forall h, \quad Q_{t-1} = I - P_{t-1}$$
-Routing $\alpha_t$ **chỉ nhìn thành phần $Q_{t-1}h$** — phần của $h$ NGOÀI span tất cả task cũ. Với $h \sim p_s$ ($s < t$): GPM đã capture $\geq 99.5\%$ variance của $p_s$ → $\|Q_{t-1}h\|^2 \leq 0.005\,\mathrm{tr}(C_s)$ → $\alpha_t(h) \approx 0$ tự nhiên. **Đây là differential projection — task-discriminative theo thiết kế.**
-**Vấn đề còn lại:** Random $A_t$ không align với hướng có variance cao nhất của task $t$ trong null-space → $\|A_t Q_{t-1}h\|^2$ nhỏ dù $\|Q_{t-1}h\|^2$ đáng kể.
-**Giải pháp — C5:** Khởi tạo $A_t$ = top-$r$ eigenvectors của $\tilde{C}_t = Q_{t-1}C_tQ_{t-1}$. Khi đó:
-$$E_{h \sim p_t}[\|A_t h\|^2] = \sum_{i=1}^r \lambda_i(\tilde{C}_t) \quad \textbf{— GIÁ TRỊ CỰC ĐẠI}$$
-**C5 biến random routing key thành optimal routing key — giải quyết GPM–Routing Paradox.**
-> **Kết nối C5 ↔ C2:** C5 không chỉ giúp learning (maximize captured variance trong null-space) mà **đồng thời** maximize routing signal. Một initialization, hai mục tiêu. Hai contribution C1/C2 và C5 là bất khả phân.
 ---
-## 3. Khung Lý thuyết
-### 3.1 Spectral Expert Signatures
-**Định nghĩa 1** *(Spectral Signature).* Với expert đóng băng $\Delta W_t = B_t A_t$ và thin SVD
-$$\Delta W_t = U_t\, \Sigma_t\, V_t^\top, \qquad V_t \in \mathbb{R}^{d \times r},\; \Sigma_t = \mathrm{diag}(\sigma_{t,1}, \ldots, \sigma_{t,r}),$$
-spectral signature là $\mathcal{S}_t = (V_t,\, \boldsymbol{\sigma}_t)$ trong đó:
-- $V_t$: **input receptive field** — $r$ hướng input mà expert xử lý,
-- $\boldsymbol{\sigma}_t$: **sensitivity spectrum** — hệ số khuếch đại biến đổi dọc mỗi hướng.
-**Góc nhìn lý thuyết thông tin.** Xem $\Delta W_t$ như một kênh tuyến tính, cột của $V_t$ là *input modes* của kênh và $\sigma_{t,i}^2$ là *gain* của mode $i$. Tổng channel capacity (Frobenius energy) là $\|\Delta W_t\|_F^2 = \sum_i \sigma_{t,i}^2$.
-### 3.2 Spectral Affinity
-**Định nghĩa 2** *(Spectral Affinity).* Độ tương hợp của input $h \in \mathbb{R}^d$ với expert $t$:
-$$\alpha_t(h) \;=\; \frac{h^\top M_t\, h}{\mathrm{tr}(M_t)\;\|h\|^2}, \qquad M_t = V_t\, \mathrm{diag}(\boldsymbol{\sigma}_t^2)\, V_t^\top$$
-Khai triển:
-$$\alpha_t(h) = \frac{\displaystyle\sum_{i=1}^{r} \sigma_{t,i}^2\;(v_{t,i}^\top h)^2}{\displaystyle\Bigl(\sum_{i=1}^{r} \sigma_{t,i}^2\Bigr)\,\|h\|^2}$$
-**Tính chất:**
-| Tính chất | Phát biểu |
-|-----------|-----------|
-| Dải giá trị | $\alpha_t(h) \in [0,\, 1]$ — weighted Rayleigh quotient chuẩn hoá |
-| Energy ratio | $\alpha_t(h) = \|\Delta W_t\, h\|^2 \;/\; \bigl(\|\Delta W_t\|_F^2\, \|h\|^2\bigr)$ |
-| Ý nghĩa | Phần channel capacity của expert $t$ được kích hoạt bởi $h$ |
-| In-distribution | $h \in \mathrm{span}(V_t) \;\Rightarrow\; \alpha_t(h) \geq \kappa_{\min}(t) > 0$ |
-| Out-of-distribution | $h \perp \mathrm{span}(V_t) \;\Rightarrow\; \alpha_t(h) = 0$ chính xác |
-### 3.3 Định lý Duality Định tuyến–Bảo vệ
-**Định nghĩa 3** *(Subspace Overlap).* Độ chồng chéo giữa các expert $i$ và $j$:
-$$\delta_{ij} = \|V_i^\top V_j\|_F^2 = \sum_{k=1}^{r} \cos^2 \theta_{ij}^{(k)}$$
-trong đó $\theta_{ij}^{(k)}$ là các *principal angles* giữa $\mathrm{span}(V_i)$ và $\mathrm{span}(V_j)$.
----
-**Định lý 1** *(Duality Định tuyến–Bảo vệ).* Nếu GPM đảm bảo $\delta_{ij} \leq \varepsilon$ với mọi $i \neq j$, thì với mọi unit input $h \in \mathrm{span}(V_{t^*})$, **routing margin** thoả mãn:
-$$\boxed{\;\alpha_{t^*}(h) \;-\; \max_{t \neq t^*}\, \alpha_t(h) \;\;\geq\;\; \kappa_{\min}(t^*)\; -\; \varepsilon\, \kappa_{\max}\;}$$
-trong đó:
-$$\kappa_{\min}(t) = \frac{\sigma_{t,\min}^2}{\sum_i \sigma_{t,i}^2}, \qquad \kappa_{\max} = \max_t\, \frac{\sigma_{t,\max}^2}{\sum_i \sigma_{t,i}^2}$$
-**Chứng minh.**
-*Cận dưới cho expert đúng.* Viết $h = V_{t^*}\, c$ với $\|c\| = 1$. Khi đó $(v_{t^*,i}^\top h)^2 = c_i^2$ và $\sum c_i^2 = 1$:
-$$\alpha_{t^*}(h) = \frac{\sum_i \sigma_{t^*,i}^2\, c_i^2}{\sum_i \sigma_{t^*,i}^2} \;\geq\; \kappa_{\min}(t^*)$$
-*Cận trên cho expert sai.* Với $t \neq t^*$:
-$$\|V_t^\top h\|^2 \leq \delta_{t,t^*} \leq \varepsilon \;\;\Rightarrow\;\; \alpha_t(h) \leq \kappa_{\max}\, \varepsilon \qquad\square$$
----
-**Hệ quả 1** *(Routing Confidence).* Với softmax routing nhiệt độ $\tau$:
-$$w_{t^*}(h) \;\geq\; \frac{1}{1 + (T{-}1)\,\exp\!\bigl(-m/\tau\bigr)}, \qquad m = \kappa_{\min}(t^*) - \varepsilon\, \kappa_{\max}$$
-Để đạt confidence mục tiêu $w_{t^*} \geq 1 - \delta$, đặt $\tau \leq m \,/\, \ln\!\bigl(\tfrac{T-1}{\delta}\bigr)$.
----
-**Hệ quả 2** *(Capacity Bound — Kết nối Grassmannian).* Gọi $k_t$ là **GPM effective rank** của task $t$ — số eigenvectors thực tế được GPM giữ lại (threshold 99.5%). Số lượng tasks tối đa trước khi null-space sụp đổ:
-$$T_{\max} \;\leq\; \frac{d}{\bar{k}\,(1 - \varepsilon)}, \qquad \bar{k} = \frac{1}{T}\sum_{t=1}^T k_t$$
-**Lưu ý quan trọng:** $\bar{k}$ thực tế $\gg r$. Với T5-small ($d=512$) và NLP tasks phong phú, GPM giữ $k_t \approx 30\text{–}80$ dims/task để đạt 99.5% variance (không phải $r=8$). Ước tính thực tế: 15 tasks × 50 dims = 750 $> d$ → **null-space bão hòa là rủi ro thực**, không phải lý thuyết. Đây là lý do §3.10 thảo luận null-space collapse riêng. Bound Grassmannian vẫn đúng về *số lượng subspace $r$-chiều có thể pack*, nhưng capacity constraint của GPM là $\sum_t k_t \leq d$, chặt hơn $\sum_t r \leq d$.
-### 3.4 Cam kết Trực giao từ Kiến trúc InfLoRA
-> **Đây là phần đóng cửa lỗ hổng lý thuyết then chốt.** Reviewer thường lo ngại: "GPM gradient projection chỉ chiếu gradient, không đảm bảo các $\Delta W_t$ có subspace trực giao." Observation này *đúng* về GPM gradient projection nhưng *nhầm cơ chế* — tính trực giao đến từ bước khác: InfLoRA A-projection, cứng hơn nhiều.
-**Mệnh đề 2** *(InfLoRA đảm bảo Điều kiện Định lý 1).* Với $P_{\text{old}} = \mathcal{B}\mathcal{B}^T$ là GPM projection matrix (built từ tasks $1,\ldots,t-1$), bước InfLoRA chiếu **tất cả hàng của $A_t$ vào null-space của $P_{\text{old}}$**:
-$$A_t \leftarrow A_t(I - P_{\text{old}}) \quad\Rightarrow\quad \text{rowspace}(A_t) \subseteq \text{null}(P_{\text{old}})$$
-Khi đó:
-$$\text{span}(V_t) \;=\; \text{rowspace}(\Delta W_t) \;\subseteq\; \text{rowspace}(A_t) \;\subseteq\; \text{null}(P_{\text{old}})$$
-**(Chứng minh từng bước.)**
-- $\text{rowspace}(B_t A_t) \subseteq \text{rowspace}(A_t)$: đúng với mọi $B_t$ (phép nhân bên trái không mở rộng rowspace).
-- $\text{rowspace}(A_t) \subseteq \text{null}(P_{\text{old}})$: bởi bước InfLoRA projection ở trên.
-- GPM bases $\mathcal{B}$ span xấp xỉ $\text{rowspace}(A_s)$ cho các task $s < t$ (vì GPM tích lũy principal input directions, mà activation của task $s$ chủ yếu kích hoạt theo hướng $A_s$).
-- Do đó: $\text{span}(V_t) \subseteq \text{null}(P_{\text{old}}) \approx \perp \text{span}(V_s)$ với mọi $s < t$. $\square$
-**Chất lượng xấp xỉ:** Với GPM threshold $\varepsilon_0 = 0.995$ (capture ≥ 99.5% variance), $\delta_{t,s} \leq 0.005 \ll \kappa_{\min}(t^*)$ trong thực tế.
-**Phân tích độ nhạy GPM (Davis–Kahan perturbation).** GPM trong thực tế chỉ capture *xấp xỉ* principal directions nên có sai số $\Delta P_{t-1}$ so với GPM lý tưởng. Theo Davis–Kahan theorem, perturbation trong subspace angle bị chặn bởi:
-$$\sin\!\bigl(\Theta(\hat{V}_s, V_s)\bigr) \leq \frac{\|\hat{C}_s - C_s\|_2}{\delta_{\text{gap}}(C_s)}$$
-trong đó $\hat{C}_s$ là sample covariance (200 batches), $\delta_{\text{gap}}$ là eigenvalue gap. Với batch size 200 và T5-small $d=512$, sai số này nhỏ khi $\delta_{\text{gap}}$ đủ lớn (task distributions phân kỳ). Kết quả thực tế: margin trong Định lý 1 bị giảm thêm $O(\|\Delta P\|_F / \delta_{\text{gap}})$ — nhỏ với tasks có spectrum phân kỳ, lớn hơn với same-domain tasks (expected). Xẩy ra same-domain routing failure vẫn là giới hạn của *mọi* zero-replay CL method, không phải đặc thù SpecRoute.
-### 3.5 Lemma về Differential Projection
-**Lemma 1** *(Differential Projection — Exact).* Với $A_t$ thoả mãn InfLoRA constraint $A_t P_{t-1} = 0$, với **mọi** $h \in \mathbb{R}^d$:
-$$\|A_t h\|^2 \;=\; \|A_t Q_{t-1} h\|^2, \quad Q_{t-1} = I - P_{t-1}$$
-**Chứng minh.** Viết $h = P_{t-1} h + Q_{t-1} h$. Vì $A_t P_{t-1} = 0$ (các hàng $A_t$ trực giao với colspace của $P_{t-1}$):
-$$A_t h = A_t P_{t-1} h + A_t Q_{t-1} h = 0 + A_t Q_{t-1} h \qquad \square$$
-**Hệ quả A (Current expert trên old data — from Lemma 1):** Với $h \sim p_s$ ($s < t$):
-$$E_{h \sim p_s}\!\left[\alpha_t(h)\right] = \frac{E[\|A_t Q_{t-1} h\|^2]}{r\,\|h\|^2} \leq \frac{\mathrm{tr}(Q_{t-1} C_s)}{r} \leq \frac{(1-\tau_\text{GPM})\,\mathrm{tr}(C_s)}{r} \leq \frac{0.005\,\mathrm{tr}(C_s)}{r}$$
-**Hệ quả B (Old expert trên new data — GPM-capture argument):** Với $h \sim p_t$ ($t > s$). Vì GPM sau task $s$ tích lũy principal directions của task $s$'s activations, trong đó $A_s$'s rowspace (= top-$r$ directions của $\tilde{C}_s$) được capture vào $P_s \subseteq P_{t-1}$. Do đó:
-$$\text{rowspace}(A_s) \subseteq \text{range}(P_{t-1}) \quad\Rightarrow\quad A_s Q_{t-1} = 0$$
-Suy ra $A_s h_t = A_s P_{t-1} h_t$ và:
-$$E_{h \sim p_t}[\alpha_s(h)] = \frac{E[\|A_s P_{t-1} h_t\|^2]}{r\,\|h_t\|^2} \leq \frac{\mathrm{tr}(P_{t-1} C_t)}{r\,\mathrm{tr}(C_t)} \cdot \frac{\mathrm{tr}(C_t)}{r}$$
-Với task $t$ có domain mới (khác task cũ): $\mathrm{tr}(P_{t-1} C_t) / \mathrm{tr}(C_t) = \text{PEV}_{t,\text{old}} \ll 1$ — fraction variance của task $t$ được giải thích bởi các cơ sở cũ. Với same-domain tasks: $\text{PEV}_{t,\text{old}}$ lớn hơn, đây là giới hạn cơ bản của mọi zero-replay CL method, không phải lỗ hổng đặc thù của SpecRoute.
----
-### 3.6 Định lý C5 Routing Optimality (Đóng góp Lý thuyết Chính)
-**Định nghĩa 4** *(Restricted Stiefel Manifold).*
-$$\mathcal{A}_t = \{A \in \mathbb{R}^{r \times d} : A P_{t-1} = 0,\; A A^\top = I_r\}$$
-**Định lý 2** *(C5 Routing Optimality).* Với $C_t = E_{h \sim p_t}[hh^\top]$ và $\tilde{C}_t = Q_{t-1} C_t Q_{t-1}$:
-$$\operatorname{argmax}_{A_t \in \mathcal{A}_t} E_{h \sim p_t}[\alpha_t(h)] \;=\; \text{top-}r\text{ eigenvectors của } \tilde{C}_t$$
-Giá trị cực đại: $\displaystyle\frac{1}{r}\sum_{i=1}^r \lambda_i(\tilde{C}_t)$
-**Chứng minh.** Từ Lemma 1:
-$$E_{h \sim p_t}[\alpha_t(h)] = \frac{E[\|A_t Q_{t-1} h\|^2]}{r\,E[\|h\|^2]} = \frac{\mathrm{tr}(A_t\,\tilde{C}_t\,A_t^\top)}{r}$$
-Với ràng buộc $A_t A_t^\top = I_r$, đây là **Constrained PCA** tiêu chuẩn trên $\tilde{C}_t$: lời giải là eigenvectors ứng với eigenvalues lớn nhất. Đây chính xác là C5. $\square$
-**Ý nghĩa:** C5 **đồng thời** tối ưu:
-1. **Learning quality:** maximize $\mathrm{tr}(A_t \tilde{C}_t A_t^\top)$ = variance captured trong null-space → $B_t$ học được hiệu quả.
-2. **Routing signal:** maximize $E[\alpha_t(h)]$ → routing phân biệt task $t$ tốt hơn mọi init khác.
-*Đây là lý do C5 và C2 là bất khả phân: C5 biến random routing key thành optimal routing key.*
 ---
-### 3.7 Định lý Routing Margin với GPM + C5
-**Định lý 3** *(Explicit Routing Margin).* Gọi $\lambda_t^\min = \lambda_r(\tilde{C}_t)$ (r-th eigenvalue của projected covariance). Với C5 init và A-row routing ($\tau_\text{GPM} = 0.995$):
-$$\boxed{E_{h \sim p_t}[\alpha_t(h)] - \max_{s < t}\,E_{h \sim p_s}[\alpha_t(h)] \;\geq\; \frac{\lambda_t^\min}{r} - \frac{0.005\,\bar{\sigma}^2}{r}}$$
-với $\bar{\sigma}^2 = \max_s \mathrm{tr}(C_s)$.
-**Hệ quả (GPM–Routing Paradox Formalized):** Với random $A_t$ (không có C5), routing signal:
-$$E_{h \sim p_t}[\alpha_t^\text{rand}(h)] \approx \frac{\mathrm{tr}(\tilde{C}_t)}{d'}$$
-Tỷ lệ lợi thế C5 over random:
-$$\frac{E[\alpha_t^\text{C5}(h)]}{E[\alpha_t^\text{rand}(h)]} \;=\; \underbrace{\frac{d'}{r}}_{\text{null-space factor}} \cdot \underbrace{\text{PEV}_r(\tilde{C}_t)}_{\text{task concentration}}$$
-**Quan sát quan trọng:** Factor $d'/r$ và $\text{PEV}_r$ đều **tăng ý nghĩa về mặt routing khi null-space shrinks** (later tasks). C5 quan trọng nhất chính khi routing khó nhất.
-Với T5-small task 8 ($d' \approx 351$, $r=8$): tỷ lệ $\approx 44\times \cdot \text{PEV}_8 \gg 1$.
 ---
-### 3.8 Training–Inference Routing Split
-**Mệnh đề 2** *(Two-Phase Routing).* SpecRoute dùng hai cơ chế routing khác nhau theo phase:
-| Phase | Cơ chế | Lý do |
-|-------|---------|-------|
-| Training (task $t$) | **Oracle: luôn route 100% về current task** | Task ID luôn biết khi training; spectral routing sẽ kill gradient vì GPM-Routing paradox |
-| Inference | **Hard Top-1 spectral routing** (với calibration normalization) | Task ID không có; routing tự động từ A-row calibrated affinity |
-**Tại sao training cần oracle routing:**
-GPM-Routing Paradox (§3.10) buộc $A_t \perp h_t$ theo nghĩa: $A_t$ nằm trong null-space của $P_{\text{old}}$, trong khi $h_t$ có energy lớn nhất trên $\text{span}(P_{\text{old}})$ cho các task cùng domain với old tasks. Hệ quả: $\|A_t h_t\|^2 \approx 0$ → fit score của current task gần như 0 → spectral argmax chọn old task → $B_t$ không nhận gradient → **không bao giờ học**.
-Oracle routing trong training là standard practice trong CL: mọi task-specific CL method (GainLoRA, O-LoRA) đều dùng task ID trong training (GainLoRA train MLP routing bằng CE loss với task label). Oracle routing không phải cheating — task ID luôn available khi training theo CL protocol.
-**Calibration Normalization (FIX 3) tại inference:**
-A-row fit scores có scale khác nhau giữa các tasks (task đầu có $A_t$ trong full $d$-dim space, task sau bị constrain vào null-space nhỏ hơn). Để argmax so sánh công bằng, chúng tôi normalize mỗi task's score bằng EMA fit scale thu thập khi training:
-$$\alpha_t^{\text{cal}}(h) = \frac{\alpha_t(h)}{\hat{\mu}_t}, \qquad \hat{\mu}_t = \text{EMA}\!\left[\frac{\|A_t h\|^2}{r\|h\|^2}\right]_{\text{training data of } t}$$
-Routing inference: $t^* = \arg\max_t \alpha_t^{\text{cal}}(h)$ (hard Top-1).
-**Điều kiện α-sufficiency (để inference routing đúng):**
-Routing của expert $t$ thắng tại inference nếu $\alpha_t^{\text{cal}}(h_t) > \alpha_s^{\text{cal}}(h_t)$ cho mọi $s \neq t$. Đây phụ thuộc vào:
-- Scale consistency: $\hat{\mu}_t$ ước lượng tốt $E[\alpha_t | h \sim p_t]$ → calibrated score ổn.
-- C5 advantage: $\alpha_t$ trên task-t data cao hơn $\alpha_s$ (s là task khác domain) → sau calibration vẫn win.
-- Same-domain limit: task cùng domain có $\alpha_t \approx \alpha_s$ ngay cả sau calibration — đây là giới hạn cấu trúc được thảo luận §3.10.
----
-### 3.9 Drift Invariance
-**Mệnh đề 3** *(Drift-Free Routing).* Hàm routing $h \mapsto \alpha_t(h)$ bất biến qua tất cả tasks: $h$ từ frozen embedding table trước mọi attention layer; $A_t$ đóng băng sau C5 init. $\square$
-**Làm rõ về layer routing:** Routing được tính *một lần* tại input (token embedding, trước block transformer đầu tiên) — không phải routing riêng biệt tại mỗi transformer layer. Vector $h$ = mean-pool của token embeddings (frozen `embed_tokens`), không thay đổi qua training. Điều này đảm bảo routing hoàn toàn frozen và không drift. C5 initialization per-layer (mỗi attention layer có $A_t^{(l)}$ riêng) phục vụ *learning quality* chứ không phải routing — routing chỉ dùng encoder layers để aggregate signal.
----
-### 3.10 Vấn đề Null-Space Collapse (Còn tồn tại, được giải một phần)
-Định lý 1 giả định $h \in \text{span}(V_{t^*})$. Điều kiện này:
-- **(A) Expert phải học được:** $A_t$ phải align với task-relevant directions. C5 giải quyết bằng data-informed init.
-- **(B) Input phải co projection:** Inputs thực phải có energy trên $\text{span}(V_t)$.
-C5 giải (A). (B) được đảm bảo khi null-space ($d'$) còn đủ rộng để capture task-t variance. Với $d=512$, $r=8$, 15 tasks: null-space vẫn đủ theo Hệ quả 2.
-Null-space (sau $t-1$ tasks) là một không gian $d - N_{\text{protected}}$ chiều. Kaiming random trong không gian này KHÔNG ĐẢM BẢO alignment với các hướng có liên quan đến task $t$. Khi null-space thu hẹp dần (Layer 7: 8/512 → 161/512 → 344/512 qua 13 tasks), xác suất random init bắt được đúng hướng task-relevant giảm theo.
-**Hệ quả thực nghiệm (V6):** IMDB (task 8) — eval_loss dừng ở 6.37 sau 10 epoch, EM=0.0 suốt quá trình, expert thực sự không thể học bất cứ điều gì hữu ích.
----
-## 4. Các Thành phần Framework
-### C1 — Spectral Expert Signatures (V8: A_t as Signature)
-**V8 thay đổi từ V7:** Signature là $\mathcal{S}_t = A_t$ (model parameter), **không cần thin SVD post-training**. Lý do từ Định lý 2: rowspace của $V_t$ (từ SVD của $B_tA_t$) = rowspace của $A_t$ (phép nhân trái không mở rộng rowspace) — SVD chỉ thêm $\sigma$-weighting gây noise. Với C5 init, $A_t$ rows **đã là** task-discriminative directions.
-- **Không cần `prepare_inference_routing()`** — loại bỏ $O(dr^2)$ overhead per task per layer.
-- **Không tham số bổ sung** — $A_t$ là model parameter đã có.
-- **Bất biến** — $A_t$ đóng băng sau C5 init (Mệnh đề 3).
-### C2 — Data-Informed Differential Routing (V8)
-**V8 thay đổi từ V7:** Cả training lẫn inference đều dùng **A-row formula** — loại bỏ hoàn toàn `prepare_inference_routing()` và SVD inference mismatch. Lý do từ Lemma 1 + Định lý 2: $A_t$ rows với C5 init đã là routing optimal directions; SVD của $B_tA_t$ chỉ thêm $\sigma^2$-weighting từ B optimization artifact, không có đảm bảo lý thuyết.
-**Routing formula (Inference):**
-$$t^* = \arg\max_t \;\alpha_t^{\text{cal}}(h), \qquad \alpha_t^{\text{cal}}(h) = \frac{\|A_t h\|^2/r\|h\|^2}{\hat{\mu}_t}$$
-trong đó $\hat{\mu}_t$ là EMA fit scale thu thập khi training task $t$ (Calibration Normalization, §3.8).
-**Training** (task $t$): oracle routing — current task luôn được gán weight=1.0 (§3.8).
-**Lý giải A-row routing (từ Lemma 1 + Định lý 2):**
-- **Exact decomposition (Lemma 1):** $\|A_t h\|^2 = \|A_t Q_{t-1} h\|^2$ — routing chỉ nhìn null-space component, không bị ảnh hưởng bởi task cũ.
-- **Optimality với C5 (Định lý 2):** C5 init chọn $A_t$ = argmax $E[\|A_t h\|^2]$ trên tất cả $A_t \in \mathcal{A}_t$ — A-row affinity là tốt nhất có thể trong constraint.
-- **Margin bound (Định lý 3):** $E[\alpha_t|h \sim p_t] - \max_s E[\alpha_t|h \sim p_s] \geq \lambda_t^\min/r - 0.005\bar{\sigma}^2/r > 0$.
-- **Loại bỏ `prepare_inference_routing()`:** $A_t$ đóng băng sau C5 init — là signature không cần tái tính SVD của $B_tA_t$.
-**Lý giải adaptive $\beta(n)$:** Giải $w_t = \alpha_{\mathrm{target}}$ trong softmax → closed-form $\beta(n) = \tau\ln(\alpha_\text{target} \cdot n/(1-\alpha_\text{target}))$. Tránh $O(1/n)$ softmax dilution khi số task tăng.
-**Lợi thế C5 so với random A-row (Hệ quả Định lý 3):**
-$$\frac{E[\alpha_t^\text{C5}(h)]}{E[\alpha_t^\text{rand}(h)]} = \frac{d'}{r} \cdot \text{PEV}_r(\tilde{C}_t) \approx 44\times \text{ tại task 8 (T5-small)}$$
-| Phase | Cơ chế routing | Notes |
-|-------|----------------|-------|
-| Training (task $t$) | **Oracle: weight=1.0 cho current task** | Tránh GPM-Routing paradox kill gradient |
-| **Inference (mọi task)** | **Hard Top-1 calibrated A-row argmax** | **Calibration normalize scale; không SVD** |
-| Đảm bảo lý thuyết | Định lý 3 margin bound + C5 advantage $44\times$ | Inference routing dùng calibrated affinity |
-### C3 — Capacity-Aware Subspace Allocation
-GPM threshold kiểm soát đánh đổi bảo vệ–capacity. Từ Định lý 1:
-- $\varepsilon$ thấp hơn → bảo vệ & routing tốt hơn, nhưng null-space cạn nhanh hơn.
-- $\varepsilon$ cao hơn → nhiều capacity hơn, nhưng đảm bảo routing yếu hơn.
-**Dynamic threshold** (theo InfLoRA):
-$$\varepsilon_t = (1 - \varepsilon_0) \cdot \frac{t}{T} + \varepsilon_0$$
-trong đó $\varepsilon_0$ là base threshold. Phân bổ bảo vệ nghiêm ngặt dần khi task tích luỹ. Đánh đổi là *có nguyên tắc* qua Hệ quả 2: miễn là $\varepsilon_t$ vượt $(1 - d/(rT))$, capacity cho tất cả $T$ task được đảm bảo.
----
-### C4 — Spectrally-Conditioned Gradient (Implementation Detail)
-> **Lưu ý phân loại:** C4 là chi tiết triển khai, không phải đóng góp lý thuyết độc lập. Nó giải quyết một vấn đề kỹ thuật thuần túy: sau khi `get_reg_matrix()` chiếu $A_t$ vào null-space, column space của $A_t$ không còn trực giao, khiến gradient $\nabla_B \mathcal{L} = \nabla_{\Delta W} \mathcal{L} \cdot A^T$ bị biến dạng. Việc áp dụng preconditioner là một hiệu chỉnh kỹ thuật cần thiết, không phải một luận điểm học thuật mới.
-Gradient $\nabla_B \mathcal{L}$ bị biến dạng bởi condition number của $A^T$. Chúng tôi áp dụng preconditioner một lần sau khi $A$ đóng băng:
-$$\tilde{\nabla}_B = \nabla_B \mathcal{L} \cdot (AA^T + \epsilon I)^{-1/2}$$
-Preconditioner được tính **một lần** sau `get_reg_matrix()` — không có overhead per-step.
-> **Lưu ý:** Spectral entropy regularization (C4.2) được loại bỏ khỏi V7. Lý do: C5 (Data-Informed Init) khởi tạo $A_t$ sao cho $B_t$ học trong subspace task-relevant → singular values tự nhiên sẽ phân tán theo dữ liệu. Cưỡng bức entropy uniform qua regularization mâu thuẫn với triết lý của C5 (để dữ liệu dẫn dắt phân phối phổ, không phải regularizer). Preconditioner gradient (C4.1) vẫn giữ vì nó sửa điều kiện ma trận, không ảnh hưởng đến triết lý data-driven.
----
-### C5 — Data-Informed Subspace Initialization (Đóng góp chính)
-#### Động lực
-Khi $A_t$ được khởi tạo ngẫu nhiên và chiếu vào null-space, nó chiếm một điểm *tùy ý* trên restricted Grassmannian $\mathrm{Gr}(r,\, d - N_{\text{protected}})$. Với $\dim\bigl(\mathrm{Gr}(8, 351)\bigr) = 8 \times 343 = 2744$, không gian lựa chọn rất lớn — random init gần như chắc chắn sub-optimal. Đặc biệt khi null-space thu hẹp, các hướng task-relevant ngày càng chiếm tỷ lệ nhỏ trong không gian còn lại, làm cho random init càng kém hiệu quả.
-#### Bài toán tối ưu
-Chúng tôi đặt vấn đề khởi tạo $A_t$ như bài toán tối ưu có ràng buộc:
-$$\max_{A_t} \quad \text{tr}\!\bigl(A_t\, Q\, C_t\, Q\, A_t^T\bigr) \quad \text{s.t.} \quad A_t A_t^T = I_r$$
-trong đó $Q = I - P_{\text{old}}$ là null-space projector (với $P_{\text{old}} = \mathcal{B}\mathcal{B}^T$ là GPM projection matrix), và:
-$$C_t = \frac{1}{|\mathcal{X}_t|} \sum_{x \in \mathcal{X}_t} h(x)\, h(x)^T$$
-là activation covariance của task $t$ được ước tính từ vài batch đầu của dữ liệu training.
-**Ý nghĩa:** Maximize variance captured trong null-space theo phân phối dữ liệu task $t$ — tức là tìm subspace $r$-chiều trong null-space *phù hợp nhất* với dữ liệu task hiện tại.
-#### Lời giải dạng đóng
-Định nghĩa **projected covariance**: $\tilde{C}_t = Q\, C_t\, Q$.
-Bài toán trở thành constrained PCA tiêu chuẩn trên $\tilde{C}_t$. Lời giải chính xác là:
-$$A_t = \text{top-}r\text{ eigenvectors của } \tilde{C}_t$$
-hay tương đương, các hàng của $A_t$ là $r$ eigenvectors ứng với eigenvalues lớn nhất của $\tilde{C}_t = Q C_t Q$.
-**Thuật toán** (Constrained PCA trong null-space):
-```
-# Bước 1: Thu thập activation covariance (forward pass nhỏ, trước training)
-C_t = ∑ h(x)h(x)^T / N_batch    # covariance input task t (N_batch ~100 batches)
-# Bước 2: Project covariance vào null-space
-Q = I - P_old                   # null-space projector (từ GPM bases đã lưu)
-C_tilde = Q @ C_t @ Q           # projected covariance
-# Bước 3: Eigenvector decomposition
-eigvals, eigvecs = eigh(C_tilde) # đối xứng → eigh nhanh hơn SVD
-# Bước 4: Fallback nếu signal quá yếu (degenerate null-space)
-if eigvals[-1] < 1e-6:
-    # Null-space bị bão hoà hoặc task không có activation rõ ràng
-    # Revert về Kaiming random init + InfLoRA projection như gốc
-    continue
-top_r_idx = argsort(eigvals, descending=True)[:r]
-# Bước 5: Set A_t
-A_t = eigvecs[:, top_r_idx].T   # shape (r, d) — direction task-relevant nhất trong null-space
-A_t = A_t / norm(A_t, dim=1, keepdim=True) * sqrt(3)  # normalize như InfLoRA gốc
-```
-**Điều kiện fallback:** Nếu `max_eigenvalue(C_tilde) < 1e-6`, null-space quá hẹp hoặc activation không có signal đủ mạnh. Trong trường hợp này, C5 nhường cho Kaiming init + InfLoRA projection tiêu chuẩn — không làm tệ hơn V6, chỉ không cải thiện. Điều kiện này chỉ xảy ra khi null-space gần như bão hoà, tức là ESA đã tiêu thụ gần hết capacity.
-**C5 per-layer:** Mỗi LoRA layer (encoder Q, V; decoder self/cross Q, V) có $C_t$ riêng thu thập từ activation tương ứng của layer đó. GPM cũng lưu $P_{\text{old}}^{(l)}$ riêng theo layer $l$. Do đó eigenvector decomposition được thực hiện độc lập cho từng layer — mỗi $A_t^{(l)}$ chỉ capture variance task-relevant trong null-space của layer $l$.
-**Hệ kết với Routing:** Routing sử dụng input embedding (frozen embedding table output, trước tất cả transformer layers) và A-row của các encoder layers — không phải per-layer routing riêng biệt. C5 per-layer cải thiện **học hiệu quả** của $B_t$ tại mỗi layer, còn routing signal (§3.2) được aggregate qua các encoder layers.
-#### Ý nghĩa Lý thuyết Thông tin
-Theo Data Processing Inequality, với bất kỳ ma trận $A_t$ nào:
-$$I(A_t h;\, y) \leq I(h;\, y)$$
-Nhưng trong ràng buộc null-space, không phải mọi $A_t$ đều bằng nhau. Data-informed $A_t$ **maximize** $I(A_t h; y)$ trong lớp các $A_t$ thỏa mãn null-space constraint — trong khi random $A_t$ chỉ capture một phần ngẫu nhiên, không được tối ưu hoá.
-Ngoài ra, khi $A_t$ được khởi tạo tốt hơn, $B_t$ huấn luyện trong subspace có liên quan đến task → $\sigma_{t,i}$ lớn hơn → spectral signature $\mathcal{S}_t$ mạnh hơn → routing margin $\kappa_{\min}(t)$ trong Định lý 1 **tăng**. Đây là kết nối trực tiếp từ C5 trở lại lý thuyết routing.
-#### Tương thích Zero-Replay
-$C_t$ được tính từ **dữ liệu training của task hiện tại** (task $t$ đang được huấn luyện). Đây không phải replay (replay = tái sử dụng dữ liệu *cũ*). Dữ liệu training của task hiện tại luôn sẵn có trong CL setting. $A_t$ (model parameter) chỉ encode *hướng* (không phải giá trị hay vị trí dữ liệu cụ thể), tương tự GPM bases cũng tính từ activation covariance và đã được chấp nhận trong InfLoRA, GainLoRA. ✅ zero-replay compliant.
-#### Kết nối với Bài toán Gốc
-| V6 failure mode | Root cause | C5 giải quyết |
-|-----------------|-----------|---------------|
-| IMDB/SST2 EM=0 (never-learning) | $A_t$ random bỏ lỡ task-relevant directions trong null-space | $A_t$ data-informed capture variance cao nhất trong null-space → $B_t$ CÓ THỂ học |
-| Routing degradation (yelp 55→36) | Expert quality thấp → $\sigma \approx 0$ → signature = noise → routing ngẫu nhiên | Expert quality tăng → $\sigma > 0$ đáng kể → routing có phân biệt |
 ---
-## 5. Những gì Loại bỏ từ GainLoRA
-| Thành phần | GainLoRA | SpecRoute | Lý do |
-|------------|----------|-----------|-------|
-| MLP `trans_input` | Learned routing projection | ❌ Loại bỏ | Duality: spectral affinity là đủ |
-| `prompt_key` | Learned per-task key | ❌ Loại bỏ | Thay bằng spectral signatures |
-| `previous_trans_input` | Frozen MLP copies | ❌ Loại bỏ | Signatures bất biến theo cấu trúc |
-| KL distillation | Replay-based routing loss | ❌ Loại bỏ | Không learned routing → không cần distill |
-| GPM trên routing params | Subspace cho routing | ❌ Loại bỏ | Không có routing parameters để bảo vệ |
-| **`prepare_inference_routing()`** | **SVD của $B_tA_t$ post-training** | **❌ Loại bỏ (V8)** | **$A_t$ là signature, kh\u00f4ng c\u1ea7n t\u00e1i t\u00ednh SVD; lo\u1ea1i b\u1ecf $O(dr^2)$ overhead** |
-| **SVD $\sigma^2$-weighted inference** | **Rayleigh quotient khác train formula** | **❌ Loại bỏ (V8)** | **Train-inference mismatch; A-row đủ từ Lemma 1 + Định lý 2** |
-**Hiệu ứng tổng thể:** Toàn bộ subspace và compute budget mà GainLoRA dành cho routing infrastructure được *thu hồi* cho task learning.
----
-## 6. Hai Đóng góp Cốt lõi
-> **Nguyên tắc cấu trúc:** Hai đóng góp này tạo thành một vòng lặp logic khép kín — phần 1 xây dựng lý thuyết, phần 2 giải quyết rào cản thực tế để lý thuyết đó có thể hoạt động. Không có phần nào có thể đứng độc lập mà không cần phần kia.
----
-### Đóng góp 1: Khung Định tuyến Phổ Phi tham số
-> *Tổng hợp từ C1, C2, C3 — một lập luận thống nhất, không phải ba thủ thuật riêng biệt.*
-**Vấn đề trung tâm:** Các phương pháp CL trước đây (GainLoRA, O-LoRA) coi routing và bảo vệ là hai bài toán độc lập, dẫn đến vòng lặp xấu: routing parameter trôi dạt → GPM phải bảo vệ routing → subspace của task learning bị thu hẹp → routing lại yếu hơn.
-**Đóng góp:** Chúng tôi hình thức hóa và chứng minh rằng **bảo vệ không gian con trực giao và routing phân biệt là hai biểu hiện kép của cùng một cấu trúc phổ** (Định lý 1). Từ tính đối ngẫu này, chúng tôi dẫn xuất cơ chế routing hoàn toàn phi tham số: mỗi input được định tuyến đến expert có spectral affinity cao nhất với subspace đặc trưng của expert đó — không cần tham số học, không cần replay, không tốn GPM overhead cho routing infrastructure.
-**Đảm bảo lý thuyết:**
-- Routing margin $\geq \kappa_{\min}(t^*) - \varepsilon\, \kappa_{\max}$ — tỷ lệ thuận với chất lượng bảo vệ (Định lý 1).
-- Routing weight $w_{t^*} \geq 1-\delta$ với nhiệt độ $\tau \leq m/\ln\!\bigl((T-1)/\delta\bigr)$ (Hệ quả 1).
-- Capacity bound $T_{\max} \leq d/(\bar{k}(1-\varepsilon))$ qua lý thuyết Grassmannian packing (Hệ quả 2), trong đó $\bar{k}$ là GPM effective rank ($\approx 30$–$80$ dims/task), không phải LoRA rank $r=8$.
-- Routing hoàn toàn bất biến qua thời gian: $h$ từ frozen embedding table, $\mathcal{S}_t$ đóng băng sau training (Mệnh đề 1).
-**Tại sao không phải đơn giản hoá mà là tiến bộ lý thuyết:** Kết quả này cho thấy các kiến trúc như GainLoRA đang giải quyết một bài toán không tồn tại (routing parameter learning). Chúng tôi chứng minh rằng bảo vệ tốt ↔ routing tốt — hai bài toán hóa ra là *một*.
----
-### Đóng góp 2: Tối ưu hóa Không gian con Dựa trên Dữ liệu
-> *Tổng hợp từ C5 — giải quyết rào cản thực tế làm Đóng góp 1 sụp đổ lúc runtime.*
-**Vấn đề trung tâm:** Đóng góp 1 yêu cầu $h \in \mathrm{span}(V_{t^*})$ — tức là expert $t^*$ phải học được các biến đổi có ý nghĩa từ dữ liệu task. Điều này phụ thuộc vào chất lượng của $A_t$. InfLoRA (và GainLoRA) khởi tạo $A_t$ ngẫu nhiên rồi chiếu vào null-space — một điểm *tùy ý* trên Grassmannian $\mathrm{Gr}(r, d - N_{\text{protected}})$ với dimension $r(d - N_{\text{protected}} - r)$ rất lớn. Khi null-space co lại theo tasks, xác suất ngẫu nhiên bắt đúng hướng task-relevant tiệm cận về 0.
-**Đóng góp:** Chúng tôi phát biểu khởi tạo $A_t$ như **bài toán Constrained PCA trên Grassmannian bị giới hạn** và cung cấp lời giải dạng đóng:
-$$\max_{A_t} \;\text{tr}(A_t\, Q C_t Q\, A_t^T) \quad \text{s.t.} \quad A_t A_t^T = I_r \;\Rightarrow\; A_t = \text{top-}r\text{ eigenvectors của } QC_tQ$$
-$A_t$ này đảm bảo capture **variance task-relevant tối đa** trong null-space có sẵn — biến vấn đề ngẫu nhiên thành vấn đề tất định. Tuân thủ zero-replay vì $C_t$ tính từ dữ liệu task *hiện tại* (không phải cũ), cùng logic với GPM bases đã được chấp nhận trong InfLoRA.
-**Vòng lặp khép kín với Đóng góp 1:** $A_t$ tốt hơn → $B_t$ học trong subspace task-relevant → $\sigma_{t,i}$ lớn hơn sau training → $\kappa_{\min}(t^*)$ trong Định lý 1 tăng → routing margin tăng → Đóng góp 1 hoạt động như lý thuyết dự đoán.
----
-> **C4 (gradient preconditioning)** là chi tiết triển khai kỹ thuật cần thiết để sửa điều kiện ma trận sau projection, không phải đóng góp lý thuyết. Xem §4 để biết chi tiết.
----
-## 7. Pipeline Huấn luyện
-### Task 1 (`--run_single True`)
-1. Load pretrained model + fresh LoRA ($A$: Kaiming, $B$: zeros).
-2. Huấn luyện chuẩn (chỉ `lora_B`) — single expert, không routing.
-3. Sau training: cập nhật GPM bases (ESA threshold). **Không cần tính SVD hay `prepare_inference_routing()`** — $A_t$ là signature.
-4. Lưu: LoRA weights, GPM reg files.
-### Task $t \geq 2$
-1. Load model + fresh LoRA; load LoRA weights cũ.
-2. **[C5 — MỚI]** Pre-task forward pass (200 batch, no grad):
-   - Thu thập activation covariance $C_t$ của inputs task $t$
-   - Tính projected covariance: $\tilde{C}_t = Q C_t Q$ ($Q = I - P_{\text{old}}$)
-   - Eigenvectors của $\tilde{C}_t$ → khởi tạo $A_t$ (thay thế random Kaiming)
-3. InfLoRA: chuẩn hoá $A_t$ (đã nằm trong null-space từ eigenvector decomposition).
-4. Huấn luyện `lora_B` với **oracle routing** (current task weight=1.0) + gradient preconditioning (C4.1). EMA fit scale cho calibration normalization tự động được thu thập trong quá trình training.
-5. Sau training: cập nhật GPM bases (200 batches). **Không gọi `prepare_inference_routing()`** — $A_t$ đóng băng từ bước 2 là signature đủ dùng.
-6. Lưu tất cả artifacts cho task tiếp theo.
----
-## 8. Mapping Lý thuyết – Implementation
-| Lý thuyết | Implementation | File |
-|-----------|---------------|------|
-| Spectral signature $\mathcal{S}_t = A_t$ | `lora_A` weights (frozen after C5 init) — **không cần SVD** | `t5_specroute.py` |
-| Spectral affinity $\alpha_t(h)$ (inference) | Calibrated A-row fit: `fit / E_fit_ema`, hard Top-1 argmax | `compute_spectral_routing()` |
-| Oracle routing (training) | `weights[:, 0] = 1.0` — current task always selected | `compute_spectral_routing()` |
-| ~~A-row proxy $\alpha_t^{\mathrm{train}}$ + $\beta(n)$~~ | ~~A-row fit + adaptive bias~~ | **❌ Loại bỏ (V9 oracle training)** |
-| ~~Symmetric inference SVD~~ | ~~`prepare_inference_routing()` → SVD của $B_t A_t$~~ | **❌ Loại bỏ hoàn toàn (V8)** |
-| Calibration normalization $\hat{\mu}_t$ | EMA fit scale collected during training, stored in signatures | `compute_spectral_routing()` |
-| Drift-free input $h$ | `inputs_embeds = embed_tokens(input_ids)` → mean-pool | `T5Stack.forward()` |
-| GPM + InfLoRA null-space | `get_reg_matrix()` | `cl_trainer_specroute.py` |
-| Dynamic ESA threshold | `(1−ε₀)·t/T + ε₀` | `cl_trainer_specroute.py` |
-| C4: Preconditioner | `precompute_preconditioners()` → eigendecomposition | `cl_trainer_specroute.py` |
-| **C5: Data-informed init** | **`pre_task_data_collection()` → `eigh(Q@C@Q)` → set `lora_A.data`** | **`cl_trainer_specroute.py`** |
-| C5: Fallback | max eigval < 1e-6 → skip C5, keep Kaiming + InfLoRA projection | `cl_trainer_specroute.py` |
-| **V10a: Learned Routing** | **`Trans_input` + `prompt_key` gating with exact post-step GPM constraints** | **`t5_specroute.py` & `cl_trainer_specroute.py`** |
-| **V10b: Grassmann Routing** | **Geometry-based routing via Grassmannian distance on batch principal subspaces** | **`t5_specroute.py`** |
----
-## 9. Thiết lập Thí nghiệm
-| Hạng mục | Giá trị |
-|----------|---------|
-| Mô hình | `google/flan-t5-small` (60M) / `flan-t5-large` (783M) |
-| Benchmarks | SuperNI (15 tasks, 2 orderings), Long (15 tasks, 2 orderings) |
-| Metrics | AP (Average Performance, ↑), FT (Forgetting, ↓) |
-| LoRA | $r = 8$, target=Q+V, InfLoRA (chỉ B trained, A đóng băng) |
-| Routing | Oracle training (current task weight=1.0); Hard Top-1 calibrated A-row (inference) |
-| ESA | $\varepsilon_0 = 0.995$ (dynamic) |
-| C4 | Gradient preconditioning bật (`--use_preconditioning True`), $\epsilon = 10^{-6}$; entropy reg đã loại bỏ V7 |
-| **C5** | **N_batch = 100, `torch.linalg.eigh` trên projected covariance, fallback nếu max_eigval < 1e-6** |
-| GPM repr. | 200 batches (giảm từ 1000 — SVD ổn định sau 200) |
-| **Scalability note** | **C5 per-layer eigdecomp: $O(d^2)$ per layer per task. Với T5-small ($d=512$) tất cả layers: chấp nhận được. Với flan-t5-large ($d=1024$): 4× đắt hơn nhưng vẫn chỉ tuyến tính theo tasks. Với LLaMA-7B ($d=4096$): sử dụng randomized SVD hoặc Lanczos (top-$r$ eigenvecs không cần full decomp) giảm xuống $O(dr)$ per layer.** |
-| Precision | fp32 + gradient_checkpointing (T5 + P100: fp16 có risk NaN overflow với large softmax) |
-| P100 BSZ | BSZ=8, GA=4 (effective 32); T4: BSZ=2, GA=8 |
-| Thời gian (P100 16GB) | SuperNI T5-Small ≈ 2-3h; Long benchmark ≈ 3-4h — thoải mái trong 12h Kaggle |
-| So sánh | Batch size, LR, scheduler khớp chính xác ROOT (GainLoRA) |
----
-## 10. File Map
-| File | Vai trò |
-|------|---------|
-| `src/t5_specroute.py` | T5Stack + spectral routing + thin SVD |
-| `src/t5_gainlora_inflora.py` | LoRALayer, T5Attention, T5Block (shared base) |
-| `src/cl_trainer_specroute.py` | Trainer: GPM, InfLoRA, ESA, C4, C5, training_step |
-| `src/run_t5.py` | Entry: model loading, parameter freezing |
-| `gen_script_*_specroute*.sh` | Experiment scripts |

+# SpecRoute: Định tuyến Phổ với Khởi tạo Phân biệt và Chiếu Nhận biết Chồng lấn trong Học Liên tục LoRA
+> **Tài liệu thiết kế chính thức** — Ràng buộc: cheat ≤ root (GainLoRA). Theory-first.
 ---
+# PHẦN I — BÀI TOÁN VÀ ĐỘNG LỰC
 ---
+## 1. Bài toán và Ràng buộc
+### 1.1 Bài toán
+Bài toán học liên tục với mô hình ngôn ngữ lớn (LLM) đặt yêu cầu: mô hình tiếp thu tuần tự $T$ task không đồng nhất — phân loại cảm xúc, suy luận ngôn ngữ, hỏi đáp — trong điều kiện dữ liệu task trước không khả dụng tại thời điểm huấn luyện task mới. Với mỗi task $t$, một LoRA adapter $\Delta W_t = B_t A_t$ được bổ sung vào mô hình nền đóng băng; $A_t$ đóng băng sau khởi tạo (InfLoRA constraint), chỉ $B_t$ được huấn luyện. Tại inference, task identity không được cung cấp:
+$$y = f\!\Bigl(W_0\, x \;+\; \sum_{t=1}^{T} w_t(x)\; B_t A_t\, x\Bigr)$$
+Ba thách thức đan xen:
+| Thách thức | Yêu cầu |
+|:----------:|---------|
+| **Định tuyến (R)** | Xác định adapter phù hợp cho input đầu vào |
+| **Bảo vệ (P)** | Đảm bảo adapter cũ không bị suy giảm khi học task mới |
+| **Phân bổ (A)** | Quản lý dung lượng subspace hữu hạn cho $T$ adapters |
+### 1.2 Ràng buộc: Cheat ≤ Root
+GainLoRA (baseline) lưu trữ các thống kê sau từ old tasks:
+- SVD bases của activation covariance (GPM): `reg_{i}.pt` per layer
+- SVD bases của MLP activation ở 3 tầng: `trans_input/reg_{0,1,2}.pt`
+- Frozen MLP weights (`previous_trans_input.pt`) + frozen prompt keys
+- **Mean routing distribution** từ eval inference pass trên old data (`attention_weights.pkl`) → dùng làm KL distillation target
+**Nguyên tắc**: SpecRoute được phép lưu thống kê second-moment từ old tasks (cùng loại với GPM bases), miễn là tổng "cheat budget" không vượt quá root. Cụ thể, SpecRoute **không** cần eval inference pass trên old data, không cần frozen MLP copies, không cần KL distillation targets — tiết kiệm hơn root ở ba mục này, đổi lại lưu thêm projected covariance $\tilde{C}_s$ per task.
+---
+## 2. Baseline và Vấn đề
+### 2.1 GainLoRA
+GainLoRA (NeurIPS 2025) tiếp cận ba thách thức với ba cơ chế riêng biệt:
+**Định tuyến** qua hai thành phần học được:
+- **`trans_input`**: MLP hai tầng ($d_{\text{model}} \to d_{\text{hidden}} \to d_{\text{model}}$, kích hoạt SiLU) biến đổi embedding trung bình của sequence thành query vector $q \in \mathbb{R}^{d}$.
+- **`prompt_key`**: vector tham số $k_t \in \mathbb{R}^{d}$ per task, đóng vai trò routing key. Tại inference, $q$ được tính cosine similarity với tất cả key $\{k_1, \ldots, k_T\}$; kết quả qua sigmoid tạo trọng số gating cho từng adapter.
+**Bảo vệ** qua **GPM (Gradient Projection Memory)**: sau mỗi task, thu thập activation covariance, tính SVD, lưu basis $U_t \in \mathbb{R}^{d \times r_t}$. Gradient lên `lora_A` và `trans_input` bị chiếu sang null-space: $\Delta W \leftarrow \Delta W - UU^\top \Delta W$.
+**Phân bổ** qua dynamic threshold $\varepsilon_t$ tăng dần, kiểm soát số singular vectors giữ lại trong GPM basis.
+### 2.2 Mâu thuẫn Cấu trúc
+Routing phụ thuộc tham số học được, tạo vòng lặp:
+$$\texttt{trans\_input}\ \text{drift} \;\to\; \text{prompt\_key misalign} \;\to\; \text{GPM bảo vệ routing} \;\to\; \text{tốn subspace} \;\to\; \cdots$$
+Để ổn định, GainLoRA cần KL distillation: lưu phân phối routing $\{p_s\}$ từ old eval data, minimize $D_{\text{KL}}(p_s^{\text{stored}} \| p_s^{\text{current}})$ — yêu cầu eval inference pass trên old data + lưu routing statistics.
+### 2.3 Hai Vấn đề Cốt lõi cần Giải quyết
+**Vấn đề 1 — Same-domain Learning Collapse (Routing)**: Khi tasks cùng domain (yelp/amazon/imdb, TF-IDF cosine ~ 0.89), C5 init ($A_t = \text{top eigvecs}(\tilde{C}_t)$) cho ra các eigenvectors gần trùng nhau → routing margin ~ 0 → expert output ~ 0 → learning collapse. Bằng chứng: v10a qqp=11.95, rte=10.11 so với root 76.96, 45.85.
+**Vấn đề 2 — Shared Subspace Exclusion (Learning)**: InfLoRA buộc $A_t \in \text{null}(P_{\text{old}})$ nghiêm ngặt. Khi tasks chia sẻ tri thức (same-domain), **optimal learning directions nằm trong old subspace** — chính xác nơi InfLoRA cấm $A_t$ tiếp cận. Kết quả: model bị ép học từ noise directions, mất forward transfer.
+> **Mấu chốt**: Vấn đề 1 liên quan đến *phân biệt giữa các tasks* (routing), Vấn đề 2 liên quan đến *chia sẻ tri thức giữa các tasks* (learning). Hai vấn đề yêu cầu hai giải pháp bổ trợ nhau.
 ---
+## 3. Ý tưởng
+### 3.1 Duality Định tuyến–Bảo vệ
+GPM đảm bảo các adapter chiếm subspace gần trực giao trong không gian input. Sự trực giao này đồng thời tạo tín hiệu routing tự nhiên: input $h_t$ đặc trưng cho task $t$ có alignment cao với $\text{span}(V_t)$ và gần bằng không với $\text{span}(V_s)$ ($s \neq t$). Do alignment = routing, không cần tham số học.
+> **Duality**: Chống catastrophic forgetting và nhận diện task xuất phát từ cùng một cấu trúc trực giao subspace.
+### 3.2 GPM–Routing Paradox
+InfLoRA khởi tạo $A_t$ ngẫu nhiên trước khi chiếu vào null-space. Ma trận ngẫu nhiên không mang thông tin task → affinity score xấp xỉ nhau → routing gần ngẫu nhiên. GPM đảm bảo trực giao (điều kiện cần), nhưng khởi tạo ngẫu nhiên triệt tiêu tín hiệu routing (điều kiện đủ bị vi phạm).
+### 3.3 Giải pháp Vấn đề 1: Contrastive Projected Initialization (CPI)
+**Phiên bản C5 cũ** (v2–v10a, thất bại): $A_t = \text{top-}r\text{ eigvecs}(\tilde{C}_t)$ với $\tilde{C}_t = Q_{t-1}C_tQ_{t-1}$.
+**Vấn đề**: Với same-domain tasks, $\tilde{C}_{\text{amazon}} \approx \tilde{C}_{\text{yelp}}$ → top eigenvectors gần trùng → routing margin $\approx 0$ → **learning collapse**.
+**CPI thay đổi objective**: thay vì tìm hướng variance lớn nhất, tìm **hướng phân biệt nhất** so với old tasks:
+$$\boxed{A_t^{\text{CPI}} = \arg\max_{A \in \mathcal{A}_t} \; \text{tr}\!\left(A\,\bigl(\tilde{C}_t - \gamma\,\bar{C}_{<t}\bigr)\,A^\top\right)}$$
+**Phiên bản gốc (unweighted)**: $\bar{C}_{<t} = \frac{1}{t-1}\sum_{s<t} \tilde{C}_s$ — trung bình đồng đều tất cả task cũ.
+**Phiên bản nâng cấp (Weighted CPI)**: gán trọng số theo *domain proximity* — task cũ càng giống task hiện tại thì càng cần bị trừ mạnh hơn:
+$$\boxed{\bar{C}_{<t}^{\,\mathrm{w}} = \frac{\sum_{s<t} \rho_{s,t}\,\tilde{C}_s}{\sum_{s<t} \rho_{s,t} + \varepsilon}, \qquad \rho_{s,t} = \frac{\text{tr}(\tilde{C}_s \cdot C_t)}{\text{tr}(\tilde{C}_s)\,\text{tr}(C_t)}}$$
+Trọng số $\rho_{s,t}$ đo mức độ alignment giữa second-moment của task cũ $s$ và task mới $t$. Khi task $s$ cross-domain với task $t$: $\rho_{s,t} \approx 0$ → đóng góp nhỏ (GPM đảm bảo cross-domain covariances gần trực giao). Khi $s$ same-domain với $t$: $\rho_{s,t}$ lớn → bị trừ mạnh, đúng với mục tiêu discriminative. Điều này chặt chẽ hơn unweighted mean vì các task cross-domain không "pha loãng" tín hiệu contrastive cho same-domain pairs.
+*Lưu ý*: Nhờ tính trực giao subspace của GPM, với cross-domain tasks ta tự nhiên có $\text{tr}(\tilde{C}_s \cdot C_t) \approx 0$ → flat unweighted mean cũng ít bias. Tuy nhiên weighted CPI đúng hơn về mặt nguyên lý và đặc biệt có lợi khi có nhiều task cross-domain nhưng chỉ ít task same-domain.
+**Lời giải**: top-$r$ eigenvectors của discriminant matrix $D_t = \tilde{C}_t - \gamma\bar{C}_{<t}^{\,\mathrm{w}}$.
+**Tại sao CPI sửa learning collapse**: $D_{\text{amazon}} = \tilde{C}_{\text{amazon}} - \gamma\tilde{C}_{\text{yelp}}$ trừ đi shared variance → eigenvectors còn lại discriminative → routing signal > 0.
+**Hướng dẫn chọn $\gamma$**: Khi $\gamma$ quá cao ($\to 1$), phần lớn eigenvalues trở nên âm → fallback Kaiming. Trong thực tế, $\gamma$ nên được chọn sao cho tỉ lệ eigenvalues dương đủ lớn (≥ $r$). Quy tắc heuristic: $\gamma^* \approx 1 - \frac{r}{\text{rank}(\tilde{C}_t)}$. Đối với flan-t5-small ($d=512$, $r=8$), $\gamma \in [0.3, 0.7]$ là vùng ổn định. Ngoài ra, code đã tích hợp cơ chế **adaptive fallback**: nếu số eigenvectors dương < $r$, phần thiếu được bù bằng random vectors trong null-space (Kaiming-scale), đảm bảo không bao giờ thất bại hoàn toàn.
+### 3.4 Giải pháp Vấn đề 2: Overlap-Aware Projection (OAP)
+**Vấn đề cốt lõi**: InfLoRA dùng hard null-space projection $A_t \leftarrow A_t(I - P_{\text{old}})$, loại bỏ **toàn bộ** thành phần của $A_t$ trong old subspace. Khi tasks chia sẻ optimal subspace, điều này phá hủy chính xác các hướng học hữu ích nhất.
+**Quantification — Shared Subspace Exclusion (SSE):**
+$$\text{SSE}_t = \frac{\text{tr}(P_{\text{old}} \cdot C_t)}{\text{tr}(C_t)} \in [0,1]$$
+$\text{SSE}_t$ đo phần variance của task $t$ nằm trong old subspace — bị InfLoRA loại bỏ. Với same-domain tasks (EDA: TF-IDF similarity yelp↔amazon = 0.898), SSE có thể đạt 0.7–0.9, nghĩa là **70–90% tín hiệu học hữu ích bị vứt bỏ**.
+**Cơ sở lý thuyết cho relaxation:**
+1. **TRGP (Lin et al., ICLR 2022)** chỉ ra rằng strict null-space projection cản trở forward transfer khi tasks tương quan mạnh; đề xuất "trust region" cho phép tái sử dụng knowledge từ old tasks liên quan qua scaled weight projection.
+2. **Shared-Private Subspace Decomposition** (multi-task learning classic): Argyriou et al. (2008) chứng minh rằng decompose representation thành shared + private components tối ưu hóa transfer trong multi-task setting.
+3. **Principal Angles trên Grassmannian**: khoảng cách geodesic giữa subspaces $\mathcal{V}_t$ và $\mathcal{V}_s$ trên Grassmann manifold $\text{Gr}(r,d)$ quyết định mức overlap. InfLoRA ép $d_G = \pi r/2$ (maximal distance) — quá mạnh cho same-domain tasks.
+4. **Information Bottleneck perspective**: InfLoRA tối thiểu hóa $I(\hat{X}_t; X_s) = 0$ (elimination hoàn to��n). Tối ưu thực sự nên là maximize $I(\hat{X}_t; Y_t)$ subject to $I(\hat{X}_t; X_s | Y_s) \leq \eta$ — cho phép sharing miễn không harm old task performance.
+**Insight then chốt**: SpecRoute dùng **hard Top-1 routing** tại inference. Khi routing chính xác ($w_{t^*} = 1$, $w_{t \neq t^*} = 0$), chỉ MỘT adapter fire per input → overlap subspace KHÔNG gây forgetting. Forgetting chỉ xảy ra khi routing sai. Do đó, mức forgetting bị gate bởi routing error probability $p_e$, KHÔNG phải bởi subspace overlap.
+**So sánh với TRGP (§3.4.1)**:
+| Yếu tố | TRGP (Lin et al., 2022) | OAP (SpecRoute) |
+|---------|--------------------------|-----------------|
+| **Cách xác định mức relaxation** | Dựa trên task similarity heuristic: chọn old tasks "liên quan" bằng cosine similarity, dùng projected gradient lên subspace của old tasks đã chọn | Tự động per-layer: $\beta_l = \max(\beta_{\min}, 1 - \eta \cdot \rho_l)$ với $\rho_l$ đo trực tiếp overlap ratio từ covariance |
+| **Granularity** | Task-level: cùng mức relaxation cho toàn bộ mô hình | Layer-level: mỗi layer có $\beta_l$ riêng dựa trên overlap cục bộ |
+| **Bối cảnh kiến trúc** | Gradient projection cho toàn bộ tham số (full model) | Tích hợp với LoRA (chỉ $A_t$) + hard Top-1 routing → forgetting bị gate bởi $p_e$ |
+| **Kết hợp với routing** | Không có cơ chế routing riêng | Kết hợp với CPI: routing accuracy cao → $p_e$ thấp → an toàn nới $\beta_l$ |
+| **Cơ sở quyết định** | Similarity giữa task representations | Overlap ratio $\rho_l$ tính trực tiếp từ spectral analysis (Định lý 4, 5) |
+Điểm mới cốt lõi: OAP không chỉ là "nới null-space" (TRGP đã làm), mà là **nới null-space có điều kiện an toàn nhờ hard routing** — forgetting bị gate bởi $p_e \times (1-\beta_l)$ (Định lý 4), và CPI đảm bảo $p_e$ thấp (Định lý 3). Sự kết hợp ba thành phần (CPI + OAP + hard routing) tạo ra **lợi thế hệ thống** mà TRGP không có: ở cross-domain regime, ba thành phần reinforcing nhau; ở same-domain regime, $\beta_{\min}$ đảm bảo worst-case có giới hạn.
+**Formulation OAP**: Hai bước tích hợp (khác nhau về mục đích, cùng dùng $\beta_l$):
+**Bước 1 — OAP trên covariance** (cho CPI init): thay vì $\tilde{C}_t = Q_{t-1}C_tQ_{t-1}$ (InfLoRA, chiếu hoàn toàn ra null-space), dùng relaxed:
+$$\tilde{C}_t^{\text{OAP}} = (I - \beta_l P_{\text{old}})\,C_t\,(I - \beta_l P_{\text{old}})$$
+Eigenvectors của $D_t = \tilde{C}_t^{\text{OAP}} - \gamma\bar{C}^{\mathrm{w}}_{<t}$ tìm hướng discriminative *trong* OAP subspace.
+**Bước 2 — OAP projection trên $A_t$** (enforce constraint): sau khi init, áp đặt cùng relaxed projection:
+$$\boxed{A_t \leftarrow A_t(I - \beta_l \cdot P_{\text{old}})}$$
+Hai bước này hợp lý với nhau: Bước 1 hướng init về OAP subspace (discriminative), Bước 2 enforce constraint đó. Hiệu ứng tổng hợp: P_old component trong $A_t$ bị scale khoảng $(1-\beta_l)^2$, tức **conservative hơn** một lần projection đơn lẻ. Đây là thiết kế cố ý nhằm đảm bảo an toàn cao hơn.
+Trong cả hai bước, $\beta_l$ được tính từ overlap ratio per-layer:
+$$\rho_l = \frac{\text{tr}(P_{\text{old}}^{(l)} \cdot C_t^{(l)})}{\text{tr}(C_t^{(l)})}$$
+$$\beta_l = \max(\beta_{\min},\; 1 - \eta \cdot \rho_l)$$
+- $\eta = 0$: InfLoRA gốc (strict null-space) → không forward transfer
+- $\eta = 1, \rho_l = 1$: $\beta_l = \beta_{\min}$ → maximum sharing
+- Cross-domain ($\rho_l \approx 0$): $\beta_l \approx 1$ → gần strict null-space (auto-adaptive)
+- Same-domain ($\rho_l \approx 0.8$): $\beta_l \approx 1 - 0.8\eta$ → relax đáng kể, giữ shared directions
+**Cơ chế bảo vệ khi routing chưa tốt (§3.4.2)**:
+Phản biện hợp lý: ở các tasks đầu hoặc khi CPI chưa đủ mạnh, routing accuracy thấp → nới $\beta_l$ có thể tăng forgetting. Các biện pháp bảo vệ:
+1. **$\beta_{\min}$**: luôn đảm bảo mức bảo vệ tối thiểu, ngay cả khi $\rho_l$ rất cao.
+2. **Warmup theo task index**: $\eta_{\text{eff}}(t) = \eta \cdot \min(1, (t-1)/T_{\text{warmup}})$. Ở task 2 (chưa có đủ CPI data), $\eta_{\text{eff}} \approx 0$ → gần InfLoRA gốc. Khi $t$ tăng, CPI accumulates nhiều $\tilde{C}_s$ → routing tốt hơn → an toàn nới $\eta_{\text{eff}}$.
+3. **$\beta_{\min}$ cao cho tasks đầu**: Khi $t \leq T_{\text{warmup}}$ (default 3), dùng $\beta_{\min} = 0.7$ (conservative); sau đó giảm về $\beta_{\min} = 0.3$.
+4. **Auto-detection**: $\rho_l \approx 0$ cho cross-domain → $\beta_l \approx 1$ tự động → không cần lo OAP gây hại khi tasks khác domain.
+**SSE reduction:** OAP *giữ lại* phần variance trong old subspace thay vì loại bỏ hoàn toàn (InfLoRA):
+$$\text{SSE}_t^{\text{OAP}} \approx (1-\beta_l)^2 \cdot \text{SSE}_t \;\in\; [0,\,\text{SSE}_t]$$
+Khi $\beta_l = 1$ (InfLoRA strict): $(1-1)^2 = 0$ → loại bỏ hoàn toàn. Khi $\beta_l = \beta_{\min}$ (OAP maximum): $(1-\beta_{\min})^2 \cdot \text{SSE}_t > 0$ → giữ lại một phần để học.
+### 3.5 Tương tác Tổng thể: CPI + OAP
+**CPI** giải Vấn đề 1 — tìm hướng *phân biệt* → routing mạnh.
+**OAP** giải Vấn đề 2 — nới lỏng null-space → *chia sẻ tri thức* → learning mạnh.
+Kết hợp: CPI hoạt động trên **relaxed projected covariance**:
+$$D_t = Q_{\text{OAP}} \cdot C_t \cdot Q_{\text{OAP}} - \gamma\,\bar{C}_{<t}^{\,\mathrm{w}}$$
+trong đó $Q_{\text{OAP}} = I - \beta_l P_{\text{old}}$.
+**Tương tác: Controlled Trade-off với Worst-case Có Giới hạn**
+CPI và OAP không phải "vòng cung cố lẫn nhau không điều kiện" — đây là một *controlled trade-off* với các điều kiện an toàn được thiết kế rõ ràng:
+| Regime | CPI | OAP | Kết quả |
+|--------|-----|-----|---------|
+| **t = 1** (task đầu) | Không có old covs → C5 init (γ bị bỏ qua) | $\eta_{\text{eff}} = 0$ → strict InfLoRA, OAP deactivated | Behavior giống baseline; không có risk |
+| **Cross-domain** (dễ) | Discriminative init tốt, routing margin cao | $\rho_l \approx 0$ → $\beta_l \approx 1$ (strict, tự động) | Routing tốt, forgetting thấp |
+| **Same-domain** (khó) | Margin thấp hơn (cấu trúc); CPI cải thiện đáng kể so với C5 | $\rho_l$ cao → $\beta_l$ giảm, nhưng bị sàn bởi $\beta_{\min}$ | Forward transfer tăng; forgetting bị kiểm soát bởi $p_e \cdot (1-\beta_{\min})$ |
+**Điều kiện an toàn quan trọng**: Ở regime same-domain (khó nhất), forgetting không tăng không giới hạn vì:
+1. **$\beta_{\min}$ là bound toán học chứng minh được** (Định lý 4 và §4.7): forgetting $\leq p_e \cdot (1-\beta_{\min}) \cdot M$ bất kể $\rho_l$ lớn bao nhiêu.
+2. **So sánh đúng baseline**: Điểm tham chiếu không phải "không có OAP" mà là InfLoRA gốc (v10a) — vốn đã bị broken hoàn toàn (SSE 70-90%, qqp=11.95, rte=10.11). OAP không cần tốt hơn lý thuyết; chỉ cần tốt hơn baseline đã thất bại này.
+3. **Warmup ($\eta_{\text{eff}}(t)$) là empirical safeguard** (không phải bound lý thuyết): giúp tránh rủi ro thực nghiệm ở tasks đầu khi chưa có đủ CPI history.
+*Lưu ý*: Claim "AP gain > forgetting cost" là observation thực nghiệm trên Long Order3/SuperNI benchmarks, không phải bound lý thuyết chứng minh được. Điều kiện đủ lý thuyết chỉ được thiết lập cho forgetting (Định lý 4), không phải cho AP gain tuyệt đối.
 ---
+# PHẦN II — LÝ THUYẾT
 ---
+## 4. Lý thuyết và Chứng minh
+### 4.1 Spectral Signature và Affinity
+**Định nghĩa 1** *(Spectral Signature).* Với expert đóng băng $\Delta W_t = B_t A_t$ và thin SVD $\Delta W_t = U_t \Sigma_t V_t^\top$, spectral signature là $\mathcal{S}_t = (V_t, \boldsymbol{\sigma}_t)$:
+- $V_t \in \mathbb{R}^{d \times r}$: input receptive field.
+- $\boldsymbol{\sigma}_t$: sensitivity spectrum.
+**Định nghĩa 2** *(Spectral Affinity).*
+$$\alpha_t(h) = \frac{\|\Delta W_t h\|^2}{\|\Delta W_t\|_F^2 \|h\|^2} = \frac{\sum_{i=1}^{r}\sigma_{t,i}^2(v_{t,i}^\top h)^2}{(\sum_i \sigma_{t,i}^2)\|h\|^2} \in [0,1]$$
+### 4.2 Định lý 1: Routing–Protection Duality
+**Định nghĩa 3** *(Subspace Overlap).* $\delta_{ij} = \|V_i^\top V_j\|_F^2$.
+**Định lý 1.** Nếu GPM đảm bảo $\delta_{ij} \leq \varepsilon$ $\forall i \neq j$, thì với unit input $h \in \text{span}(V_{t^*})$:
+$$\boxed{\alpha_{t^*}(h) - \max_{t \neq t^*}\alpha_t(h) \geq \kappa_{\min}(t^*) - \varepsilon\kappa_{\max}}$$
+$\kappa_{\min}(t) = \sigma_{t,\min}^2/\sum_i\sigma_{t,i}^2$.
+**Chứng minh.** $h = V_{t^*}c$, $\|c\|=1$ → $\alpha_{t^*}(h) \geq \kappa_{\min}(t^*)$. Với $t \neq t^*$: $\|V_t^\top h\|^2 \leq \delta_{t,t^*} \leq \varepsilon$ → $\alpha_t(h) \leq \kappa_{\max}\varepsilon$. $\square$
+**Hệ quả 1** *(Confidence).* $w_{t^*}(h) \geq 1/(1+(T-1)e^{-m/\tau})$, $m = \kappa_{\min}(t^*)-\varepsilon\kappa_{\max}$.
+**Hệ quả 2** *(Capacity Bound).* $T_{\max} \leq d/(\bar{k}(1-\varepsilon))$.
+### 4.3 Mệnh đề 1: InfLoRA Orthogonality
+InfLoRA chiếu $A_t$ vào null-space: $A_t \leftarrow A_t(I-P_{\text{old}})$ → $\text{rowspace}(A_t) \subseteq \text{null}(P_{\text{old}})$. Vì $\text{rowspace}(B_tA_t) \subseteq \text{rowspace}(A_t)$:
+$$\text{span}(V_t) \subseteq \text{null}(P_{\text{old}}) \approx \perp\,\text{span}(V_s) \;\forall s < t \;\square$$
+### 4.4 Lemma 1: Differential Projection
+Với $A_tP_{t-1} = 0$ (InfLoRA), $\forall h$:
+$$\|A_t h\|^2 = \|A_t Q_{t-1}h\|^2, \quad Q_{t-1} = I - P_{t-1}$$
+**Hệ quả A**: $E_{h \sim p_s}[\alpha_t(h)] \leq 0.005\,\text{tr}(C_s)/r$ (old data bị reject).
+**Hệ quả B**: $\alpha_s(h_t)$ chỉ phụ thuộc $P_{t-1}h_t$.
+### 4.5 Định lý 2: CPI Optimality
+**Định nghĩa 4** *(Restricted Stiefel Manifold).* $\mathcal{A}_t = \{A \in \mathbb{R}^{r \times d}: AP_{t-1}=0, AA^\top=I_r\}$.
+**Định lý 2** *(CPI là Optimal Discriminative Init).* Cho $D_t = \tilde{C}_t - \gamma\bar{C}_{<t}^{\,\mathrm{w}}$ (hoặc unweighted $\bar{C}_{<t}$ — chứng minh tương đương vì chỉ phụ thuộc cấu trúc D_t, không phụ thuộc cách tính C̄):
+$$\arg\max_{A_t \in \mathcal{A}_t}\left[E_{h \sim p_t}[\alpha_t(h)] - \gamma\cdot\frac{1}{t-1}\sum_{s<t}E_{h \sim p_s}[\alpha_t(h)]\right] = \text{top-}r\text{ eigvecs của } D_t$$
+**Chứng minh.** Từ Lemma 1: $E_{p_t}[\alpha_t(h)] = \text{tr}(A\tilde{C}_tA^\top)/r$. Objective = $\frac{1}{r}\text{tr}(AD_tA^\top)$. Với $AA^\top = I_r$, Constrained PCA trên $D_t$. $\square$
+**Kết nối Fisher Discriminant:** Khi $\gamma = 1$: $D_t$ tương tự between-class scatter trong LDA, nhưng chỉ dùng second moments.
+### 4.6 Định lý 3: CPI Routing Margin
+Cho $\lambda_{\min}^+(D_t) = \min\{\lambda_i(D_t): \lambda_i > 0, i \leq r\}$. Với CPI init:
+$$\boxed{E_{p_t}[\alpha_t(h)] - \max_{s<t}E_{p_s}[\alpha_t(h)] \geq \frac{\lambda_{\min}^+(D_t)}{r}}$$
+### 4.7 Định lý 4: OAP Forgetting Bound
+**Định lý 4** *(Routing-Gated Forgetting).* Với hard Top-1 routing (SpecRoute inference), relaxed projection $\beta_l < 1$, và routing error probability $p_e(s) = P(\text{route sai adapter} \mid h \sim p_s)$:
+$$\boxed{\text{FT}(s) \leq p_e(s) \cdot (1-\beta_l) \cdot \frac{\|B_t\|_F \cdot \sqrt{\text{tr}(C_s)}}{\text{output scale}}}$$
+**Chứng minh chi tiết.**
+*Thiết lập.* Xét input $h_s \sim p_s$ (old task $s$). Tại inference, SpecRoute dùng hard Top-1:
+$$w_k(h_s) = \begin{cases} 1 & \text{nếu } k = \arg\max_j \alpha_j^{\text{cal}}(h_s) \\ 0 & \text{otherwise} \end{cases}$$
+*Bước 1: Phân chia trường hợp.*
+- **Routing đúng** ($w_s = 1$): Output = $W_0 h_s + B_s A_s h_s$. Adapter $t$ ($t \neq s$) **không đóng góp** → forgetting = 0, bất kể $A_t$ có overlap với old subspace hay không.
+- **Routing sai** ($w_t = 1$, $t \neq s$): Output = $W_0 h_s + B_t A_t h_s$. Sai lệch so với correct output: $\Delta y = B_t A_t h_s - B_s A_s h_s$.
+*Bước 2: Bound sai lệch khi routing sai.*
+$$\|\Delta y\| \leq \|B_t A_t h_s\| + \|B_s A_s h_s\| \leq \|B_t\|_F \|A_t h_s\| + \|B_s\|_F \|A_s h_s\|$$
+Với OAP, $A_t$ có thành phần trong old subspace tỉ lệ $(1-\beta_l)$:
+$$\|A_t h_s\|^2 \leq (1-\beta_l)^2 \|P_{\text{old}} h_s\|^2 + \|Q h_s\|^2$$
+Lấy kỳ vọng: $E[\|A_t h_s\|^2] \leq (1-\beta_l)^2 \text{tr}(P_{\text{old}} C_s)/r + \text{tr}(Q C_s Q)/r$. Thành phần thứ hai nhỏ (old data nằm chủ yếu trong old subspace). Thành phần thứ nhất bị scale bởi $(1-\beta_l)$.
+*Bước 3: Tổng hợp.*
+$$\text{FT}(s) = E_{h_s}[\text{loss sai}] = P(\text{routing sai}) \cdot E[\text{loss} \mid \text{routing sai}]$$
+$$\leq p_e(s) \cdot \|B_t\|_F \cdot (1-\beta_l) \cdot \sqrt{\text{tr}(P_{\text{old}} C_s)} / \text{output scale}$$
+*Giả định:* $(i)$ Hard Top-1 routing (không soft mixing). $(ii)$ $p_e(s)$ và $\|B_t\|_F$ gần như độc lập — hợp lệ vì $p_e$ phụ thuộc vào khởi tạo $A_t$ (CPI), trong khi $\|B_t\|_F$ phụ thuộc vào quá trình training của $B_t$. $(iii)$ Bỏ qua thành phần null-space (nhỏ cho old data). $\square$
+**Thảo luận về giả định**: Giả định $(ii)$ — sự độc lập giữa $p_e$ và $\|B_t\|_F$ — là xấp xỉ. Trong thực tế, khi OAP nới relaxation mạnh, $B_t$ có thể học được biểu diễn mạnh hơn ($\|B_t\|_F$ tăng), đồng thời CPI cải thiện routing ($p_e$ giảm). Hai hiệu ứng đối ngược nhau, khiến tích $p_e \cdot \|B_t\|_F$ ít thay đổi. Bound vẫn hữu ích theo nghĩa *order-of-magnitude*: forgetting ∝ $p_e \times (1-\beta_l)$, tức là bị *gate* đồng thời bởi routing accuracy và mức relaxation.
+**Hệ quả** *(Điều kiện đủ Zero-Forgetting).* Nếu CPI đảm bảo $p_e(s) \leq \delta$ cho mọi $s$, thì forgetting $\leq \delta \cdot (1-\beta_{\min}) \cdot M$ — nhỏ tùy ý khi routing accuracy cao.
+**Phân biệt quan trọng giữa $\beta_{\min}$ và warmup:**
+- **$\beta_{\min}$ — giới hạn chứng minh được (hard provable floor)**: Từ Định lý 4, $\text{FT}(s) \leq p_e(s) \cdot (1-\beta_{\min}) \cdot M$ là một *bound toán học chặt chẽ* — đúng với mọi trường hợp, bất kể task index hay routing history. Đây không phải heuristic. Việc đặt $\beta_{\min} = 0.3$ có nghĩa là forgetting bị sàn tại $0.7 \cdot p_e \cdot M$, bất kể overlap $\rho_l$ lớn đến đâu.
+- **Warmup ($\eta_{\text{eff}}(t) = \eta \cdot \min(1, (t-1)/T_{\text{warmup}})$) — biện pháp thực nghiệm (empirical safeguard)**: Warmup *không có* lý thuyết chứng minh tại sao cụ thể $T_{\text{warmup}}$ tasks. Lý do sử dụng: ở tasks đầu, CPI chưa tích lũy đủ $\tilde{C}_s$ cũ → tín hiệu contrastive yếu → routing margin thấp hơn → $p_e$ cao hơn → trong công thức forgetting, nên dùng relaxation thấp hơn. Warmup hiện thực hóa điều này theo đường tuyến tính. Tuy nhiên, **ngay cả khi loại bỏ warmup hoàn toàn, $\beta_{\min}$ vẫn đảm bảo bound forgetting**. Warmup chỉ thêm thực nghiệm safety net, không phải thay thế $\beta_{\min}$.
+### 4.8 Định lý 5: OAP Learning Gain
+**Định lý 5** *(SSE Reduction → AP Gain).* Với OAP ($\beta_l < 1$):
+$$E_{h \sim p_t}[\|A_t h\|^2] \geq \underbrace{(1-\beta_l)^2 \cdot \text{tr}(P_{\text{old}} C_t A_t^T A_t)}_{\text{shared variance (phục hồi bởi OAP)}} + \underbrace{\text{tr}(Q C_t Q A_t^T A_t)}_{\text{null-space variance (CPI)}}$$
+So với InfLoRA strict ($\beta_l = 1$): chỉ có thành phần thứ hai. OAP bổ sung shared variance → **trực tiếp tăng expected activation energy** → $B_t$ nhận gradient signal mạnh hơn → learning tốt hơn → AP cao hơn.
+**Grassmannian interpretation**: InfLoRA constrains $A_t \in \text{Gr}(r, \text{null}(P_{\text{old}}))$. OAP mở rộng search space thành $\text{Gr}(r, \mathbb{R}^d)$ with soft penalty → optimal directions (bao gồm shared) trở nên accessible.
+### 4.9 Two-Phase Routing
+| Phase | Cơ chế | Lý do |
+|-------|--------|-------|
+| **Training** | Oracle: weight=1.0 cho current task | Task ID khả dụng |
+| **Inference** | Hard Top-1 calibrated A-row argmax | Task ID không có |
+**Calibration**: $\alpha_t^{\text{cal}}(h) = \alpha_t(h)/\hat\mu_t$, $\hat\mu_t = \text{EMA}[\|A_th\|^2/(r\|h\|^2)]$.
+**Mệnh đề 2** *(Drift-Free)*: $h$ từ frozen `embed_tokens`, $A_t$ đóng băng → $\alpha_t(h)$ bất biến.
+---
+## 5. Các Đóng góp
+### Đóng góp 1: Khung Định tuyến Phổ Phi tham số (C1 + C2 + C3)
+Routing hoàn toàn phi tham số từ duality bảo vệ–định tuyến:
+- Routing margin ≥ κ_min(t*) − ε·κ_max (Định lý 1)
+- Routing drift = 0 theo cấu trúc (Mệnh đề 2)
+**C1** — Spectral Signatures: $A_t$ trực tiếp là routing key.
+**C2** — Calibrated A-row Routing: hard Top-1 + EMA calibration.
+**C3** — Dynamic ESA Threshold.
+### Đóng góp 2: Contrastive Projected Initialization (CPI)
+Giải Vấn đề 1 — Same-domain Learning Collapse:
+$$A_t = \text{top-}r\text{ eigvecs}\bigl((I-\beta_l P_{\text{old}})\,C_t\,(I-\beta_l P_{\text{old}}) - \gamma\bar{C}_{<t}^{\,\mathrm{w}}\bigr)$$
+- Optimal discriminative init (Định lý 2), routing margin ≥ λ_min+(D_t)/r (Định lý 3)
+- γ=0: fallback = C5; γ>0: contrastive
+- Hướng dẫn chọn γ: §3.3 (heuristic + adaptive fallback khi eigenvalues âm quá nhiều)
+- Storage: $\tilde{C}_s$ per task per layer (second-moment, cùng loại GPM)
+### Đóng góp 3: Overlap-Aware Projection (OAP)
+Giải Vấn đề 2 — Shared Subspace Exclusion:
+$$A_t \leftarrow A_t(I - \beta_l \cdot P_{\text{old}}), \quad \beta_l = \max(\beta_{\min},\; 1 - \eta \cdot \rho_l)$$
+- Adaptive per-layer: high overlap → relax, low overlap → strict (auto-detect)
+- Forgetting bounded by $p_e \cdot (1-\beta_l) \cdot M$ (Định lý 4) — gated bởi routing accuracy
+- AP gain tỉ lệ với recovered shared variance (Định lý 5)
+- η=0: InfLoRA gốc; η>0: OAP
+- Bảo vệ khi routing chưa tốt: warmup η theo task index + β_min cao ở tasks đầu (§3.4.2)
+- Khác biệt với TRGP: automatic per-layer β_l + tích hợp CPI + hard routing gate (§3.4.1)
+**C4** — Gradient Preconditioning: preconditioner $(AA^\top+\epsilon I)^{-1/2}$.
+---
+## 6. Kiến trúc và Thay đổi
+### 6.1 So sánh với GainLoRA
+| Thành phần | GainLoRA | SpecRoute | Thay đổi |
+|------------|----------|-----------|----------|
+| trans_input MLP | Learned routing | Loại bỏ | Duality |
+| prompt_key | Learned per-task | Loại bỏ | A_t = signature |
+| previous_trans_input | Frozen copies | Loại bỏ | Drift-free |
+| KL distillation | Replay routing loss | Loại bỏ | Không learned routing |
+| Null-space projection | Hard (β=1) | **Relaxed (β_l adaptive)** | OAP |
+| — | — | **CPI init** | Discriminative subspace |
+| — | — | **OAP projection** | Shared knowledge transfer |
+| — | — | **C4 precond** | Null-space gradient fix |
+| — | — | **Stored $\tilde{C}_s$** | Cross-task contrastive |
+### 6.2 Pipeline
+**Task 1:**
+1. Load model + fresh LoRA (Kaiming/zeros)
+2. Train lora_B
+3. GPM update → lưu reg_{i}.pt
+4. Lưu $\tilde{C}_1$ → cov_{i}.pt
+**Task t ≥ 2:**
+1. Load model + frozen LoRA cũ
+2. **[CPI+OAP]** Pre-task forward (100 batches):
+   - Thu thập $C_t$
+   - Load $\tilde{C}_1, ..., \tilde{C}_{t-1}$ → tính $\bar{C}_{\text{old}}$
+   - Tính $\rho_l = \text{tr}(P_{\text{old}} \cdot C_t)/\text{tr}(C_t)$ per layer
+   - $\beta_l = \max(\beta_{\min}, 1 - \eta_{\text{eff}}(t) \cdot \rho_l)$
+   - $Q_{\text{OAP}} = I - \beta_l P_{\text{old}}$
+   - $\tilde{C}_t^{\text{OAP}} = Q_{\text{OAP}} C_t Q_{\text{OAP}}$
+   - $D_t = \tilde{C}_t^{\text{OAP}} - \gamma \bar{C}_{\text{old}}$
+   - $A_t \leftarrow$ top-r eigvecs của $D_t$ (eigvals > 0; fallback Kaiming)
+   - **OAP projection**: $A_t \leftarrow A_t(I - \beta_l P_{\text{old}})$ (relaxed)
+3. [C4] Precompute preconditioner
+4. Train lora_B + oracle routing + EMA calibration
+5. GPM update → lưu reg_{i}.pt
+6. Lưu $\tilde{C}_t$ → cov_{i}.pt
+### 6.3 Ánh xạ → Code
+| Lý thuyết | Code | File |
+|-----------|------|------|
+| CPI: $D_t$ | `get_reg_matrix()` | cl_trainer_specroute.py |
+| OAP: $\beta_l = 1 - \eta \rho_l$ | `get_reg_matrix()` | cl_trainer_specroute.py |
+| $\rho_l = \text{tr}(PC_t)/\text{tr}(C_t)$ | `get_reg_matrix()` | cl_trainer_specroute.py |
+| Stored $\tilde{C}_s$ | cov_{i}.pt saved/loaded | cl_trainer_specroute.py |
+| γ, η, β_min | CLI args | run_t5.py |
+| A-row routing | `compute_spectral_routing()` | t5_specroute.py |
+---
+## 7. Thiết lập Thực nghiệm
+| Hạng mục | Giá trị |
+|----------|---------|
+| **Mô hình** | flan-t5-small (60M), flan-t5-large (783M) |
+| **Benchmarks** | SuperNI (15 tasks, 4 orders), Long Sequence (15 tasks, 2 orders) |
+| **Metrics** | AP (↑), FT (↓) |
+| **LoRA** | r=8, Q+V projections |
+| **Routing** | Train: oracle; Inference: hard Top-1 calibrated A-row |
+| **CPI** | γ=0.5 (default), N_batch=100 |
+| **OAP** | η=0.5 (default), β_min=0.3) |
+| **C4** | use_preconditioning True, ε=1e-6 |
+| **ESA** | ε₀=0.995 (dynamic) |
+| **GPM repr.** | 200 batches |
+| **Baselines** | GainLoRA (ROOT), InfLoRA, O-LoRA |
+### 7.1 Ablation: Sweep (γ, η) grid
+| Cấu hình | γ | η | Ý nghĩa | Kỳ vọng |
+|-----------|---|---|---------|---------|
+| C5 gốc (v10a) | 0 | 0 | Không contrastive, strict null-space | AP thấp (baseline) |
+| CPI only | 0.5 | 0 | Discriminative init, strict null-space | AP tăng nhờ routing tốt hơn |
+| OAP only | 0 | 0.5 | C5 init, relaxed null-space | AP tăng nhờ shared knowledge |
+| CPI+OAP (full) | 0.5 | 0.5 | Discriminative + shared | AP cao nhất (kỳ vọng) |
+| CPI strong | 0.7 | 0.5 | Contrastive mạnh + sharing | Kiểm tra γ cao có gây vấn đề không |
+| OAP strong | 0.5 | 0.8 | Sharing mạnh + discriminative | Kiểm tra η cao có gây forgetting không |
+### 7.2 Đo lường bổ sung
+- **SSE trước/sau OAP**: Đo $\text{SSE}_t$ và $\text{SSE}_t^{\text{OAP}}$ per layer per task → xác nhận OAP giảm SSE.
+- **Routing accuracy per task**: Đo $p_e(s)$ tại inference → xác nhận CPI cải thiện routing.
+- **$\rho_l$ distribution**: Histogram $\rho_l$ across layers → kiểm tra OAP auto-adapts đúng (cross-domain thấp, same-domain cao).
+- **Eigenvalue spectrum $D_t$**: Số eigenvectors dương vs âm → kiểm tra γ có gây quá nhiều negative không.
 ---
+## 8. Giới hạn đã biết và Phản biện
+### 8.1 Định lý 4 (Forgetting bound) phụ thuộc giả định
+Chứng minh Định lý 4 dựa trên giả định xấp xỉ: $(i)$ hard Top-1 routing, $(ii)$ $p_e$ và $\|B_t\|_F$ gần độc lập. Giả định $(i)$ chính xác theo thiết kế. Giả định $(ii)$ là xấp xỉ — trong thực tế, $p_e$ (phụ thuộc CPI/A_t init) và $\|B_t\|_F$ (phụ thuộc training dynamics) có thể tương quan gián tiếp. Tuy nhiên, bound vẫn hữu ích theo nghĩa *qualitative*: forgetting ∝ $p_e \times (1-\beta_l)$, cho thấy **hai đòn bẩy kiểm soát** (routing accuracy + relaxation level). Chứng minh chi tiết hơn (không chỉ sketch) được trình bày tại §4.7.
+**Về $\beta_{\min}$**: Đây là một **hard provable floor** trong Định lý 4 — $\text{FT}(s) \leq p_e \cdot (1-\beta_{\min}) \cdot M$ là bound toán học, không phải heuristic. Reviewer conceded điểm này (xem phản biện trong §8.3).
+**Về warmup**: Warmup là **empirical safeguard** — không có lý thuyết chứng minh cụ thể tại sao $T_{\text{warmup}}$ tasks. Nó hỗ trợ thực nghiệm nhưng không cần thiết về mặt lý thuyết (vì $\beta_{\min}$ đã là bound cứng).
+**Hướng cải thiện**: Có thể tightening bound bằng cách bound $\|B_t\|_F$ theo training loss (Proposition trong Appendix), hoặc dùng PAC-Bayes framework để có bound xác suất không phụ thuộc giả định độc lập.
+### 8.2 CPI với γ cao gây eigenvalues âm
+Khi $\gamma \to 1$, $D_t = \tilde{C}_t - \gamma\bar{C}_{<t}^{\,\mathrm{w}}$ có thể có phần lớn eigenvalues âm → số eigenvectors dương < $r$ → phải fallback Kaiming cho phần thiếu. Điều này làm giảm hiệu quả CPI.
+**Biện pháp**:
+1. **Adaptive fallback** (đã implement): nếu positive eigenvectors < $r$, phần thiếu được bù bằng random vectors trong null-space. Đảm bảo không bao giờ thất bại hoàn toàn.
+2. **Heuristic chọn γ**: $\gamma^* \approx 1 - r/\text{rank}(\tilde{C}_t)$. Với $d=512$, $r=8$: $\gamma^* \approx 0.98$ (generous), nhưng thực tế nên dùng $\gamma \in [0.3, 0.7]$ vì noise amplification.
+3. **Ablation grid**: Sweep γ ∈ {0, 0.3, 0.5, 0.7} trong experiments (§7.1) để xác định vùng ổn định per benchmark.
+**Lưu ý**: Tính tổng quát của CPI không bị ảnh hưởng nghiêm trọng vì (a) fallback luôn đảm bảo hoạt động, (b) vùng $\gamma \in [0.3, 0.7]$ rộng và ổn định cho đa số benchmarks.
+### 8.3 OAP tăng forgetting khi routing chưa tốt
+Nới $\beta_l$ để học tốt hơn nhưng nếu routing kém (tasks đầu, CPI chưa đủ mạnh) → forgetting tăng theo $p_e \cdot (1-\beta_l)$.
+**$\beta_{\min}$ là giới hạn hard, không phải heuristic**: Từ Định lý 4, dù $\rho_l = 1$ (maximum overlap), $\beta_l \geq \beta_{\min}$ luôn được đảm bảo theo thiết kế $\beta_l = \max(\beta_{\min}, \ldots)$. Do đó $\text{FT}(s) \leq p_e \cdot (1-\beta_{\min}) \cdot M$ là bound *chứng minh được* cho mọi task, mọi routing history. Reviewer đã concede điểm này: "β_min IS a provable bound". Đây không phải heuristic về mặt toán học.
+**Biện pháp bảo vệ** (đã thiết kế, chi tiết §3.4.2):
+1. $\beta_{\min}$: **Hard provable floor** — luôn giữ mức bảo vệ tối thiểu bất kể điều kiện.
+2. **Warmup η theo task index**: $\eta_{\text{eff}}(t) = \eta \cdot \min(1, (t-1)/T_{\text{warmup}})$. Task 2: $\eta_{\text{eff}} \approx 0$. Task 5+: $\eta_{\text{eff}} = \eta$ (full OAP). Đây là **empirical safeguard** — không có bound lý thuyết, nhưng làm giảm rủi ro thực nghiệm ở tasks đầu.
+3. **$\beta_{\min}$ adaptive**: β_min cao (0.7) cho tasks đầu, giảm dần (0.3) khi routing ổn định.
+4. **Auto-detection**: $\rho_l \approx 0$ cho cross-domain → $\beta_l \approx 1$ → OAP tự deactivate.
+Kết hợp 4 biện pháp → OAP chỉ nới khi: $(a)$ đã qua giai đoạn warmup, **VÀ** $(b)$ overlap thực sự cao (same-domain), **VÀ** $(c)$ luôn giữ $\beta_{\min}$ safety net (hard bound).
+### 8.4 Số lượng hyperparameters tăng
+CPI thêm γ, OAP thêm (η, β_min) → tổng cộng 3 hyperparameters mới so với InfLoRA gốc.
+**Đánh giá**: Đây là mức tăng chấp nhận được vì:
+1. **Defaults ổn định**: (γ=0.5, η=0.5, β_min=0.3) dự kiến hoạt động tốt cho đa số benchmarks (kiểm chứng qua ablation).
+2. **Fallback an toàn**: (γ=0, η=0) = InfLoRA gốc → không bao giờ tệ hơn baseline.
+3. **Disentangled**: γ chỉ ảnh hưởng init, η chỉ ảnh hưởng projection → dễ tune independently.
+4. **Grid nhỏ**: 4 cấu hình chính (§7.1) đủ để xác định vùng tốt, không cần exhaustive search.
+**Nêu rõ trong paper**: Bảng 7.1 trình bày đầy đủ ablation grid, kèm phân tích sensitivity analysis cho từng hyperparameter.
+### 8.5 Tính mới của OAP so với TRGP
+OAP chia sẻ ý tưởng "nới null-space cho tasks tương quan" với TRGP. Điều này cần được nêu rõ trong Related Work để tránh bị xem là "re-invent".
+**Điểm khác biệt cốt lõi** (chi tiết §3.4.1):
+1. **Per-layer automatic**: TRGP dùng task-level similarity heuristic; OAP dùng $\rho_l$ per-layer per-chunk từ spectral analysis.
+2. **Tích hợp routing gate**: TRGP không có routing → relaxation trực tiếp gây forgetting. OAP + hard Top-1 routing → forgetting bị gate bởi $p_e$ (Định lý 4).
+3. **Kết hợp CPI**: TRGP standalone. OAP + CPI tạo **positive feedback loop ở cross-domain regime**: routing accuracy cao → an toàn nới β_l → shared learning tốt → routing tốt hơn. Ở same-domain regime: β_min bound worst-case (không phải unbounded reinforcement).
+4. **LoRA-specific**: TRGP thiết kế cho full model. OAP chỉ tác động $A_t$ (frozen init) → minimal interference.
+**Trình bày trong paper**: Phần Related Work nêu TRGP là tiền thân quan trọng, sau đó so sánh chi tiết bảng §3.4.1 để highlight novelty.
+### 8.6 Giới hạn cấu trúc khác
+**Capacity saturation.** $T_{\max} \leq d/(\bar{k}(1-\varepsilon))$. OAP giảm nhẹ (expanded search space), nhưng không giải hoàn toàn.
+**Same-domain routing vẫn khó hơn cross-domain.** CPI + OAP cải thiện đáng kể, nhưng khi $\tilde{C}_t \approx \bar{C}_{\text{old}}$ thì $D_t$ có eigenvalues nhỏ → routing margin thấp hơn cross-domain. Đây là giới hạn cấu trúc không thể vượt qua chỉ bằng spectral methods.
+### 8.7 Hidden assumption: "AP gain từ OAP > forgetting cost" chưa được chứng minh lý thuyết
+**Giả định ẩn**: Framework CPI+OAP ngầm giả định rằng $\Delta\text{AP}(\text{OAP gain}) \geq \text{FT\_cost}(\text{OAP relaxation})$ ở regime same-domain, tức là net AP dương sau khi tính cả forgetting tăng.
+**Điều này chưa được chứng minh về mặt toán học.** Định lý 4 chỉ bound forgetting; Định lý 5 bound AP gain; nhưng không có định lý nào so sánh hai lượng này trực tiếp. Chứng minh tổng quát sẽ yêu cầu bound $\|B_t\|_F$ theo task gradient signal, sau đó so sánh scaling của AP gain (tỉ lệ shared variance recovered) và forgetting cost (tỉ lệ $p_e \times (1-\beta_l)$) — phân tích này phụ thuộc mạnh vào distributional assumptions.
+**Argument chính thực tế**:
+1. **Baseline so sánh đúng**: InfLoRA gốc (v10a) đã bị broken hoàn toàn trên same-domain (qqp=11.95, rte=10.11 vs root 76.96, 45.85) — SSE 70-90% phá hủy learning. OAP không cần beat lý thuyết; chỉ cần tốt hơn baseline đã thất bại. Đây là comparison point đúng.
+2. **Empirical verification**: Ablation OAP-only (γ=0, η=0.5) trên Long Order3/SuperNI benchmarks sẽ xác nhận (hoặc bác bỏ) giả định net positive. Đây là observation thực nghiệm, không phải lý thuyết.
+3. **Conservative fallback**: η=0 → InfLoRA gốc. Nếu OAP hurt net AP trên bất kỳ benchmark nào, khuyến nghị giữ η nhỏ cho benchmark đó.
+**Semi-formal sufficient condition từ Định lý 4 + 5**: Từ hai bounds:
+- Định lý 4: $\text{FT}(s) \leq p_e \cdot (1-\beta_l) \cdot M$
+- Định lý 5: $\Delta E[\|A_t h\|^2] \geq (1-\beta_l)^2 \cdot \text{tr}(P_{\text{old}} C_t A_t^\top A_t)$
+Tỉ lệ AP_gain/FT_cost tỉ lệ với $(1-\beta_l) \cdot \text{SSE} \cdot \text{tr}(C_t) / p_e$. **Điều kiện đủ net-positive**:
+$$\text{SSE}_t \cdot (1-\beta_l) > \frac{p_e \cdot c}{\text{tr}(C_t)}$$
+trong đó $c$ là output scale constant. Với same-domain ($\text{SSE} \approx 0.7$-$0.9$) và CPI đảm bảo $p_e$ nhỏ, điều kiện này được thỏa mãn với biên độ lớn — đây là argument *bán hình thức*, không chỉ là empirical observation. Điều còn thiếu là bound tuyệt đối cho $c$ và $p_e$ theo hyperparameters — đây mới là phần thực sự là future work.
+**Khuyến nghị paper**: "Định lý 4 và 5 cung cấp điều kiện đủ bán hình thức cho net-positive AP; ablation empirical verify điều kiện này trên mọi benchmark được kiểm tra."

improve_gainlora/IDEA_Overall.md.bak ADDED Viewed

	@@ -0,0 +1,337 @@

+# SpecRoute: Định tuyến Phổ Dẫn dắt bởi Dữ liệu trong Học Liên tục với LoRA
+> **Tài liệu thiết kế chính thức** — Ràng buộc: Zero-replay nghiêm ngặt. Theory-first.
+---
+# PHẦN I — TÌM HIỂU BÀI TOÁN
+---
+## 1. Bài toán và Ràng buộc
+### 1.1 Bức tranh chung
+Bài toán học liên tục với mô hình ngôn ngữ lớn (LLM) đặt ra yêu cầu: mô hình phải tiếp thu tuần tự $T$ task không đồng nhất — phân loại cảm xúc, suy luận ngôn ngữ, hỏi đáp — trong điều kiện dữ liệu các task trước không được lưu trữ. Bối cảnh này thuộc lớp bài toán **continual learning (CL)** với ràng buộc **zero-replay** nghiêm ngặt.
+Paradigm phổ biến hiện nay: với mỗi task $t$, bổ sung một **LoRA adapter** $\Delta W_t = B_t A_t$ vào mô hình nền đóng băng; sau khi huấn luyện xong, đóng băng adapter và khởi tạo adapter mới cho task $t+1$.
+Sau $T$ task, mô hình sở hữu $T$ adapter song song. Tại thời điểm suy luận (inference), task identity không được cung cấp; mô hình phải tự xác định adapter nào cần kích hoạt:
+$$y = f\!\Bigl(W_0\, x \;+\; \sum_{t=1}^{T} w_t(x)\; B_t A_t\, x\Bigr)$$
+Bài toán này có **ba thách thức đan xen**:
+| Thách thức | Câu hỏi đặt ra |
+|:----------:|--------------|
+| **Routing** | Biết đây là input của task nào để chọn đúng adapter? |
+| **Protection** | Đảm bảo adapter cũ không bị hỏng khi học task mới? |
+| **Allocation** | Quản lý "dung lượng không gian" cho T adapters? |
+### 1.2 Ràng buộc nghiêm ngặt
+- **Zero-replay**: Tuyệt đối không dùng lại dữ liệu task cũ — dù là raw data, synthetic data, hay thống kê tóm tắt từ data cũ (mean, prototype).
+- **No task ID at inference**: Mô hình phải tự routing mà không biết task nào đang được hỏi.
+- **InfLoRA constraint**: $A_t$ bị đóng băng sau khi khởi tạo; chỉ $B_t$ được train.
+---
+## 2. Baseline và Vấn đề
+### 2.1 GainLoRA làm gì?
+GainLoRA (NeurIPS 2025) tiếp cận ba thách thức với ba cơ chế riêng biệt:
+**Routing** được thực hiện qua hai thành phần học được:
+- **`trans_input`**: một MLP hai tầng ($d_\text{model} \to d_\text{hidden} \to d_\text{model}$, kích hoạt SiLU) biến đổi embedding trung bình của sequence thành một *query vector* $q \in \mathbb{R}^{d}$.
+- **`prompt_key`**: một vector tham số $k_t \in \mathbb{R}^{d}$ học được per task, đóng vai trò *routing key*. Tại inference, query $q$ được tính cosine similarity với tất cả các key $\{k_1, \ldots, k_T\}$ (key của task hiện tại cộng các key cũ được lưu frozen trong `previous_prompts_keys`); kết quả đi qua sigmoid để tạo trọng số gating cho từng adapter.
+**Protection** được thực hiện qua **GPM (Gradient Projection Memory)**: sau mỗi task, GPM thu thập ma trận covariance của activation trên các layer Attention, tính SVD, và lưu lại basis $U_t \in \mathbb{R}^{d \times r_t}$ spanning subspace của task $t$. Khi học task mới, mọi cập nhật gradient lên `lora_A` và lên chính `trans_input` đều bị **chiếu sang null-space** của các basis cũ ($\Delta W \leftarrow \Delta W - UU^\top \Delta W$), ngăn việc ghi đè lên kiến thức cũ.
+**Allocation** được thực hiện qua dynamic threshold: tỉ lệ energy $\varepsilon_t$ tăng dần theo số task, quyết định bao nhiêu singular vectors được giữ lại trong GPM basis — task sau được phép chiếm ít subspace hơn task trước.
+Tuy nhiên, kiến trúc này chứa đựng một mâu thuẫn nội tại.
+### 2.2 Mâu thuẫn Cấu trúc
+Routing của GainLoRA **phụ thuộc vào các tham số học được** (`trans_input` và `prompt_key`), dẫn đến một vòng lặp không thoát được: khi `trans_input` được cập nhật khi học task mới, query space dịch chuyển, làm mất alignment với các `prompt_key` cũ đã đóng băng. Để khắc phục, GPM phải bảo vệ cả `trans_input` — nhưng điều này tiêu thụ subspace vốn dành cho task learning, khiến task mới học kém hơn, lại làm `trans_input` thay đổi nhiều hơn, tạo vòng lặp:
+$$\underbrace{\texttt{trans\_input}\ \text{drift}}_{\text{task mới}} \;\to\; \underbrace{\text{prompt\_key misalign}}_{\text{routing kém}} \;\to\; \underbrace{\text{GPM bảo vệ routing}}_{\text{tốn subspace}} \;\to\; \underbrace{\text{task kém hơn}}_{\text{lại drift nhiều hơn}} \;\to\; \cdots$$
+Hơn nữa, để ổn định routing khi `trans_input` thay đổi, GainLoRA dùng **KL distillation**: lưu lại phân phối routing logit $\{p_s\}$ trên dữ liệu task cũ, sau đó minimize $D_\text{KL}(p_s^\text{stored} \| p_s^\text{current})$ — nghĩa là cần lưu trữ và tái sử dụng thông tin thống kê phụ thuộc data cũ, xấp xỉ vi phạm ràng buộc zero-replay.
+### 2.3 Câu hỏi Nghiên cứu
+Vòng lặp trên đặt ra câu hỏi căn bản: *liệu routing có thực sự cần được học hay không, hay cơ chế bảo vệ subspace đã ngầm định cung cấp tín hiệu routing mà không cần tham số bổ sung?* Câu hỏi này là xuất phát điểm cho đóng góp chính của SpecRoute.
+---
+## 3. Ý tưởng (High-Level)
+### 3.1 Quan sát chính: Bảo vệ và Routing là Hai Mặt của Một Cấu trúc
+GPM đảm bảo rằng các adapter của các task khác nhau chiếm **subspace gần như trực giao** trong không gian input — đây là cơ chế bảo vệ chống catastrophic forgetting.
+Tuy nhiên, chính sự trực giao này đồng thời tạo ra một tín hiệu routing tự nhiên: nếu adapter $t$ hoạt động theo subspace $\mathrm{span}(V_t)$ và adapter $s$ theo $\mathrm{span}(V_s)$ với $V_t \perp V_s$, thì một input $h_t$ đặc trưng cho task $t$ sẽ có độ căn chỉnh (alignment) cao với $V_t$ và gần như bằng không với $V_s$. Do đó, đo mức độ căn chỉnh của input với subspace từng adapter là điều kiện đủ để routing — không cần thêm tham số học.
+> **Duality Định tuyến–Bảo vệ**: Chống catastrophic forgetting và nhận diện task đều xuất phát từ cùng một cấu trúc trực giao subspace; hai mục tiêu này không mâu thuẫn mà bổ trợ lẫn nhau.
+### 3.2 Nghịch lý của Khởi tạo Ngẫu nhiên
+Tuy nhiên, tồn tại một nghịch lý: InfLoRA khởi tạo $A_t$ **ngẫu nhiên** trước khi chiếu vào null-space. Ma trận khởi tạo ngẫu nhiên không mang thông tin về phân phối dữ liệu task $t$ — nó chỉ chiếm một hướng tùy tiện trong không gian khả dụng. Hệ quả là affinity score của mọi adapter với một input bất kỳ sẽ xấp xỉ nhau theo kỳ vọng, triệt tiêu tín hiệu routing.
+Đây là **GPM–Routing Paradox**: GPM đảm bảo trực giao subspace (điều kiện cần cho routing tốt), song khởi tạo ngẫu nhiên phá hủy tín hiệu routing ngay tại bước khởi đầu — trước khi huấn luyện bắt đầu.
+### 3.3 Giải pháp: Khởi tạo Subspace Dẫn dắt bởi Dữ liệu
+Nghịch lý trên được giải quyết bằng cách thay thế khởi tạo ngẫu nhiên bằng lời giải của bài toán tối ưu: *hướng nào trong không gian khả dụng (null-space) phù hợp nhất với phân phối dữ liệu task $t$?*
+Lời giải dạng đóng là **top-$r$ eigenvectors của covariance activation của task $t$ được chiếu lên null-space**: trước khi huấn luyện task $t$, một lượt forward nhỏ trên dữ liệu hiện tại (không cập nhật tham số) được thực hiện để ước lượng covariance $C_t$; $A_t$ sau đó được đặt bằng các eigenvectors chính của $Q_{t-1}C_tQ_{t-1}$.
+Đây là **C5 — Data-Informed Subspace Initialization**. Về mặt lý thuyết, C5 giải quyết paradox bằng cách đảm bảo $A_t$ mang tín hiệu đặc trưng task $t$ ngay từ bước khởi tạo, đồng thời tối ưu hóa variance captured trong null-space để $B_t$ học hiệu quả hơn.
+### 3.4 Tương tác giữa Hai Đóng góp
+Hai đóng góp không độc lập: chúng hình thành một vòng củng cố lẫn nhau. Đóng góp 1 (Duality) thiết lập rằng routing margin tỉ lệ thuận với $\kappa_{\min}(t^*)$, chính là chất lượng phổ của chuyên gia. Đóng góp 2 (C5) tối đa hóa $E[\alpha_t(h)]$ ngay từ khởi tạo, từ đó $B_t$ học được các singular values lớn hơn, $\kappa_{\min}(t^*)$ tăng, và routing margin cải thiện — điều này trực tiếp tăng cường bảo đảm định lượng trong Định lý 1. C5 không chỉ giải paradox mà còn **tăng cường** toàn bộ khung lý thuyết.
+---
+# PHẦN II — DETAIL KỸ THUẬT
+---
+## 4. Lý thuyết và Chứng minh
+### 4.1 Spectral Signature và Affinity
+**Định nghĩa 1** *(Spectral Signature).* Với expert đóng băng $\Delta W_t = B_t A_t$ và thin SVD $\Delta W_t = U_t \Sigma_t V_t^\top$, **spectral signature** là $\mathcal{S}_t = (V_t, \boldsymbol{\sigma}_t)$, trong đó:
+- $V_t \in \mathbb{R}^{d \times r}$: **input receptive field** — $r$ hướng input mà expert xử lý.
+- $\boldsymbol{\sigma}_t$: **sensitivity spectrum** — hệ số khuếch đại theo từng hướng.
+**Định nghĩa 2** *(Spectral Affinity).* Độ tương hợp của input $h$ với expert $t$:
+$$\alpha_t(h) = \frac{\sum_{i=1}^{r} \sigma_{t,i}^2\,(v_{t,i}^\top h)^2}{\bigl(\sum_{i=1}^{r} \sigma_{t,i}^2\bigr)\,\|h\|^2} = \frac{\|\Delta W_t\, h\|^2}{\|\Delta W_t\|_F^2\,\|h\|^2} \;\in\; [0,1]$$
+**Ý nghĩa**: $\alpha_t(h)$ = phần "channel capacity" của expert $t$ được kích hoạt bởi $h$. Nếu $h \in \text{span}(V_t)$ thì $\alpha_t(h) \geq \kappa_{\min}(t) > 0$; nếu $h \perp \text{span}(V_t)$ thì $\alpha_t(h) = 0$ chính xác.
+---
+### 4.2 Định lý 1: Duality Định tuyến–Bảo vệ
+**Định nghĩa 3** *(Subspace Overlap).* $\delta_{ij} = \|V_i^\top V_j\|_F^2 = \sum_k \cos^2\theta_{ij}^{(k)}$ với $\theta_{ij}^{(k)}$ là principal angles.
+**Định lý 1** *(Routing–Protection Duality).* Nếu GPM đảm bảo $\delta_{ij} \leq \varepsilon$ $\forall i \neq j$, thì với unit input $h \in \mathrm{span}(V_{t^*})$:
+$$\boxed{\alpha_{t^*}(h) - \max_{t \neq t^*} \alpha_t(h) \;\geq\; \underbrace{\kappa_{\min}(t^*)}_{\text{expert quality}} - \underbrace{\varepsilon\,\kappa_{\max}}_{\text{overlap noise}}}$$
+trong đó $\kappa_{\min}(t) = \sigma_{t,\min}^2 / \sum_i \sigma_{t,i}^2$.
+**Chứng minh.** *Cận dưới:* Viết $h = V_{t^*}c$, $\|c\|=1$ → $\alpha_{t^*}(h) = \sum_i \sigma_{t^*,i}^2 c_i^2 / \sum_i \sigma_{t^*,i}^2 \geq \kappa_{\min}(t^*)$.
+*Cận trên cho expert sai:* $\|V_t^\top h\|^2 \leq \delta_{t,t^*} \leq \varepsilon$ → $\alpha_t(h) \leq \kappa_{\max}\varepsilon$. $\square$
+**Hệ quả 1** *(Routing Confidence).* Với softmax nhiệt độ $\tau$:
+$$w_{t^*}(h) \geq \frac{1}{1 + (T-1)e^{-m/\tau}}, \quad m = \kappa_{\min}(t^*) - \varepsilon\kappa_{\max}$$
+**Hệ quả 2** *(Capacity Bound).* $T_{\max} \leq d / (\bar{k}(1-\varepsilon))$ với $\bar{k}$ là GPM effective rank trung bình per task. Với T5-small ($d=512$) và $\bar{k} \approx 50$: $T_{\max} \approx 10$ ở strict threshold — null-space saturation là rủi ro thực tế, không chỉ lý thuyết.
+---
+### 4.3 Mệnh đề: InfLoRA đảm bảo điều kiện Định lý 1
+Reviewer thường lo ngại: *"GPM chỉ chiếu gradient, không đảm bảo subspace trực giao."* Điều này đúng cho GPM gradient projection, nhưng InfLoRA có cơ chế cứng hơn: **A-projection trực tiếp**.
+**Mệnh đề 1** *(InfLoRA Orthogonality).* InfLoRA chiếu tất cả hàng của $A_t$ vào null-space của $P_{\text{old}} = \mathcal{B}\mathcal{B}^T$:
+$$A_t \leftarrow A_t(I - P_{\text{old}}) \;\Rightarrow\; \text{rowspace}(A_t) \subseteq \text{null}(P_{\text{old}})$$
+Vì $\text{rowspace}(B_tA_t) \subseteq \text{rowspace}(A_t)$ (nhân trái không mở rộng rowspace), suy ra:
+$$\text{span}(V_t) \subseteq \text{null}(P_{\text{old}}) \approx \perp \text{span}(V_s) \;\forall s < t \;\square$$
+**Chất lượng xấp xỉ (Davis–Kahan).** GPM capture xấp xỉ principal directions với sai số bị chặn bởi:
+$$\sin\!\bigl(\Theta(\hat V_s, V_s)\bigr) \leq \|\hat C_s - C_s\|_2 / \delta_{\text{gap}}(C_s)$$
+Margin thực tế bị giảm thêm $O(\|\Delta P\|_F/\delta_{\text{gap}})$ — nhỏ với tasks phân kỳ, lớn hơn với same-domain tasks (giới hạn cơ bản của mọi zero-replay CL).
+---
+### 4.4 Lemma 1: Differential Projection
+**Lemma 1** *(Exact).* Với $A_t P_{t-1} = 0$ (InfLoRA constraint), với mọi $h \in \mathbb{R}^d$:
+$$\|A_t h\|^2 = \|A_t Q_{t-1} h\|^2, \quad Q_{t-1} = I - P_{t-1}$$
+**Chứng minh.** $A_t h = A_t(P_{t-1}h + Q_{t-1}h) = 0 + A_t Q_{t-1}h$. $\square$
+**Hệ quả A** *(Current expert → old data bị reject tự nhiên).* Với $h \sim p_s$ ($s < t$), GPM capture ≥99.5% variance của $p_s$ → $\|Q_{t-1}h\|^2 \leq 0.005\,\text{tr}(C_s)$:
+$$E_{h \sim p_s}[\alpha_t(h)] \leq \frac{0.005\,\text{tr}(C_s)}{r}$$
+**Hệ quả B** *(Old expert → new data bị reject tự nhiên).* Vì $\text{rowspace}(A_s) \subseteq \text{range}(P_{t-1})$ (GPM-captured), suy ra $A_s Q_{t-1} = 0$, tức $\alpha_s(h_t)$ chỉ phụ thuộc vào phần $P_{t-1}h_t$ — phần variance task-t giải thích được bởi old subspace. Với cross-domain tasks: $\text{PEV}_{t,\text{old}} \ll 1$ → routing tự nhiên reject.
+---
+### 4.5 Định lý 2: C5 Routing Optimality
+**Định nghĩa 4** *(Restricted Stiefel Manifold).* $\mathcal{A}_t = \{A \in \mathbb{R}^{r \times d} : AP_{t-1} = 0,\; AA^\top = I_r\}$.
+**Định lý 2** *(C5 là Optimal Routing Key).* Với $\tilde{C}_t = Q_{t-1} C_t Q_{t-1}$:
+$$\operatorname{argmax}_{A_t \in \mathcal{A}_t} E_{h \sim p_t}[\alpha_t(h)] = \text{top-}r\text{ eigenvectors của } \tilde{C}_t$$
+**Chứng minh.** Từ Lemma 1: $E[\alpha_t(h)] = \text{tr}(A_t \tilde{C}_t A_t^\top)/r$. Với constraint $A_tA_t^\top = I_r$, đây là Constrained PCA tiêu chuẩn → lời giải là các top eigenvectors. $\square$
+**Ý nghĩa kép:** C5 đồng thời tối ưu (1) **routing** — maximize $E[\alpha_t(h)]$, và (2) **learning** — maximize variance captured trong null-space → $B_t$ học hiệu quả hơn. Một initialization, hai mục tiêu.
+---
+### 4.6 Định lý 3: Explicit Routing Margin
+**Định lý 3.** Đặt $\lambda_{\min}(\tilde{C}_t) = \lambda_r(\tilde{C}_t)$ là eigenvalue nhỏ nhất trong top-$r$ của $\tilde{C}_t$. Với C5 init và A-row routing:
+$$\boxed{E_{h \sim p_t}[\alpha_t(h)] - \max_{s < t} E_{h \sim p_s}[\alpha_t(h)] \geq \frac{\lambda_{\min}(\tilde{C}_t)}{r} - \frac{0.005\,\bar\sigma^2}{r}}$$
+**Lợi thế C5 so với random init:**
+$$\frac{E[\alpha_t^\text{C5}]}{E[\alpha_t^\text{rand}]} = \underbrace{\frac{d'}{r}}_{\text{null-space factor}} \cdot \underbrace{\text{PEV}_r(\tilde{C}_t)}_{\text{task concentration}}$$
+Với T5-small tại task 8 ($d' \approx 351$, $r=8$): tỷ lệ $\approx 44 \times \text{PEV}_8 \gg 1$.
+Đáng chú ý: factor $d'/r$ **tăng** khi null-space thu hẹp ở các task cuối — C5 có tác động lớn nhất chính khi routing trở nên khó khăn nhất.
+---
+### 4.7 Two-Phase Routing
+| Phase | Cơ chế | Lý do |
+|-------|--------|-------|
+| **Training** (task $t$) | Oracle: weight=1.0 cho current task | $A_t$ trong null-space → $\|A_t h_t\|^2 \approx 0$ → spectral routing kill gradient |
+| **Inference** | Hard Top-1 calibrated A-row argmax | Task ID không có; dùng calibrated spectral affinity |
+Việc sử dụng oracle routing trong huấn luyện là thực hành tiêu chuẩn trong CL — task identity luôn khả dụng tại thời điểm huấn luyện.
+**Calibration Normalization** tại inference: Task đầu có $A_t$ trong full $d$-dim space; task cuối bị constrain vào null-space hẹp hơn → raw score không so sánh được. EMA scale thu thập trong training:
+$$\alpha_t^{\text{cal}}(h) = \frac{\alpha_t(h)}{\hat\mu_t}, \quad \hat\mu_t = \text{EMA}\!\left[\frac{\|A_t h\|^2}{r\|h\|^2}\right]_{\text{training of } t}$$
+**Drift-Free Guarantee (Mệnh đề 2).** $h$ từ frozen `embed_tokens`; $A_t$ đóng băng sau C5 init → $\alpha_t(h)$ bất biến hoàn toàn. Không cần frozen copies, không cần distillation.
+---
+## 5. Các Đóng góp
+### Đóng góp 1: Khung Định tuyến Phổ Phi tham số (C1 + C2 + C3)
+**Core claim**: Bảo vệ không gian con và routing phân biệt là **hai biểu hiện kép của cùng một cấu trúc phổ**. Không cần học routing; bảo vệ tốt tự động cho routing tốt.
+Từ duality này, chúng tôi dẫn xuất routing hoàn toàn phi tham số có đảm bảo lý thuyết:
+- Routing margin $\geq \kappa_{\min}(t^*) - \varepsilon\kappa_{\max}$ (Định lý 1).
+- Confidence $w_{t^*} \geq 1-\delta$ với $\tau$ tường minh (Hệ quả 1).
+- Capacity bound $T_{\max} \leq d/\bar{k}(1-\varepsilon)$ (Hệ quả 2).
+- Routing drift = 0 theo cấu trúc (Mệnh đề 2).
+**Nhận xét**: GainLoRA giải một bài toán không cần thiết: routing parameter learning là dư thừa vì cơ chế bảo vệ subspace đã ngầm định cung cấp tín hiệu routing. SpecRoute khai thác tính duality này để loại bỏ hoàn toàn overhead của learned routing.
+---
+**C1 — Spectral Expert Signatures**: Signature = $A_t$ trực tiếp (model parameter), không cần SVD post-training. Lý do: $\text{rowspace}(B_tA_t) \subseteq \text{rowspace}(A_t)$ → SVD của $B_tA_t$ chỉ thêm $\sigma$-weighting từ $B_t$ artifact, không có đảm bảo lý thuyết. Với C5 init, rows của $A_t$ **đã là** routing-optimal directions (Định lý 2). Loại bỏ overhead $O(dr^2)$ per task per layer.
+**C2 — Data-Informed Differential Routing**: Routing formula thống nhất cho cả train lẫn inference:
+$$t^* = \arg\max_t\, \alpha_t^{\text{cal}}(h) = \arg\max_t \frac{\|A_t h\|^2 / r\|h\|^2}{\hat\mu_t}$$
+Training dùng oracle; inference dùng hard Top-1 calibrated argmax. A-row optimal từ Lemma 1 + Định lý 2. Loại bỏ hoàn toàn `prepare_inference_routing()`.
+**C3 — Capacity-Aware Subspace Allocation**: Dynamic threshold:
+$$\varepsilon_t = (1-\varepsilon_0)\cdot\frac{t}{T} + \varepsilon_0$$
+Bảo vệ nghiêm ngặt dần khi task tích luỹ. Đánh đổi có nguyên tắc qua Hệ quả 2.
+---
+### Đóng góp 2: Data-Informed Subspace Initialization (C5)
+**Core claim**: Khởi tạo $A_t$ giải bài toán Constrained PCA trên restricted Grassmannian — lời giải dạng đóng, tối ưu cả routing lẫn learning, zero-replay compliant.
+$$\max_{A_t \in \mathcal{A}_t}\;\text{tr}(A_t \tilde{C}_t A_t^\top) \quad\Rightarrow\quad A_t = \text{top-}r\text{ eigenvecs của } \tilde{C}_t = Q_{t-1}C_tQ_{t-1}$$
+**Tại sao đây là đóng góp thực sự**: Không phải chỉ "better init". C5 là lời giải của một bài toán tối ưu có ràng buộc trên Grassmannian — có closed-form, có optimality proof, kết nối trực tiếp với routing theory qua Định lý 3. Init tốt hơn → $B_t$ học tốt hơn → $\sigma_{t,i}$ lớn hơn → $\kappa_{\min}(t^*)$ tăng → routing margin tăng.
+**Zero-replay compliance**: $C_t$ tính từ activation của dữ liệu task *hiện tại* — luôn available trong CL protocol, không phải replay. Cùng loại thông tin với GPM bases (second moment của activations) đã được community accept.
+**C4 — Gradient Preconditioning** *(chi tiết triển khai)*: Sau khi chiếu $A_t$ vào null-space, column space không còn trực giao �� gradient $\nabla_B\mathcal{L}$ bị biến dạng. Áp dụng preconditioner một lần sau `get_reg_matrix()`:
+$$\tilde\nabla_B = \nabla_B\mathcal{L} \cdot (AA^T + \epsilon I)^{-1/2}$$
+---
+## 6. Kiến trúc và Thay đổi Kỹ thuật
+### 6.1 So sánh với GainLoRA
+| Thành phần | GainLoRA | SpecRoute | Lý do thay đổi |
+|------------|----------|-----------|----------------|
+| MLP `trans_input` | Learned routing projection | ❌ Loại bỏ | Duality: A-row affinity đủ |
+| `prompt_key` | Learned per-task key | ❌ Loại bỏ | $A_t$ = signature trực tiếp |
+| `previous_trans_input` | Frozen MLP copies | ❌ Loại bỏ | Routing bất biến theo cấu trúc |
+| KL distillation | Replay-based routing loss | ❌ Loại bỏ | Không learned routing → không drift |
+| GPM trên routing params | Subspace reserved for routing | ❌ Loại bỏ | Không có routing params cần bảo vệ |
+| SVD post-training | `prepare_inference_routing()` | ❌ Loại bỏ | $A_t$ là signature (Định lý 2) |
+| — | — | **✅ C5: Data-informed init** | Giải GPM–Routing Paradox |
+| — | — | **✅ C4: Gradient precond.** | Sửa condition number sau projection |
+| — | — | **✅ Calibration EMA** | Scale invariant routing |
+**Net effect**: Toàn bộ subspace và compute budget dành cho routing infrastructure được thu hồi cho task learning.
+### 6.2 Pipeline Huấn luyện
+**Task 1:**
+1. Load pretrained model + fresh LoRA ($A$: Kaiming, $B$: zeros).
+2. Train chuẩn (`lora_B` only) — single expert, không routing.
+3. Sau training: cập nhật GPM bases.
+4. Lưu: LoRA weights + GPM reg files.
+**Task $t \geq 2$:**
+1. Load model + load tất cả LoRA weights cũ (frozen).
+2. **[C5]** Pre-task forward (100 batches, no grad):
+   - Thu thập $C_t = \sum h(x)h(x)^T / N$
+   - Tính $\tilde C_t = Q_{t-1}C_tQ_{t-1}$
+   - $A_t \leftarrow$ top-$r$ eigenvecs của $\tilde C_t$ (fallback: Kaiming nếu max_eigval < 1e-6)
+3. **[C4]** Precompute preconditioner $(A_tA_t^T + \epsilon I)^{-1/2}$.
+4. Train `lora_B` với oracle routing + gradient preconditioning. EMA $\hat\mu_t$ thu thập tự động.
+5. Sau training: cập nhật GPM bases (200 batches).
+6. Lưu artifacts.
+### 6.3 Mapping Lý thuyết → Code
+| Lý thuyết | Implementation | File |
+|-----------|---------------|------|
+| Spectral signature $\mathcal{S}_t = A_t$ | `lora_A` weights (frozen after C5) | `t5_specroute.py` |
+| Affinity $\alpha_t^{\text{cal}}(h)$, hard Top-1 | `compute_spectral_routing()` | `t5_specroute.py` |
+| Oracle routing (training) | `weights[:, 0] = 1.0` | `compute_spectral_routing()` |
+| EMA calibration $\hat\mu_t$ | EMA fit scale, stored in signatures | `compute_spectral_routing()` |
+| Drift-free input $h$ | `embed_tokens(input_ids)` → mean-pool | `T5Stack.forward()` |
+| GPM + InfLoRA null-space | `get_reg_matrix()` | `cl_trainer_specroute.py` |
+| Dynamic ESA threshold | `(1−ε₀)·t/T + ε₀` | `cl_trainer_specroute.py` |
+| C4: Preconditioner | `precompute_preconditioners()` → eigh | `cl_trainer_specroute.py` |
+| C5: Data-informed init | `pre_task_data_collection()` → `eigh(Q@C@Q)` → `lora_A.data` | `cl_trainer_specroute.py` |
+---
+## 7. Thiết lập Thực nghiệm
+| Hạng mục | Giá trị |
+|----------|---------|
+| **Mô hình** | `google/flan-t5-small` (60M params), `flan-t5-large` (783M) |
+| **Benchmarks** | SuperNI (15 tasks, 4 orderings), Long Sequence (15 tasks, 2 orderings) |
+| **Metrics** | AP — Average Performance ↑, FT — Forgetting ↓ |
+| **LoRA** | $r=8$, target Q+V attention projections |
+| **Routing** | Training: oracle (task weight=1.0); Inference: hard Top-1 calibrated A-row |
+| **C5** | $N_\text{batch}=100$, `torch.linalg.eigh` trên projected covariance |
+| **C4** | Gradient preconditioning bật (`--use_preconditioning True`), $\epsilon=10^{-6}$ |
+| **ESA threshold** | $\varepsilon_0 = 0.995$ (dynamic) |
+| **GPM repr.** | 200 batches |
+| **Precision** | fp32 + `gradient_checkpointing` |
+| **Batch size** | P100 16GB: BSZ=8, GA=4 (eff. 32); T4: BSZ=2, GA=8 |
+| **Thời gian** | SuperNI T5-small ≈ 2–3h; Long ≈ 3–4h (P100) |
+| **Baselines** | GainLoRA (ROOT), InfLoRA, O-LoRA, EWC, L2P — same BSZ/LR/scheduler |
+**Scalability note**: C5 eigdecomp là $O(d^2)$ per layer per task.
+- T5-small ($d=512$): chấp nhận được.
+- LLaMA-7B ($d=4096$): dùng randomized SVD/Lanczos → $O(dr)$.
+---
+## 8. Giới hạn đã biết
+**Same-domain routing (structural limit).** Với tasks cùng domain (yelp/amazon/imdb, TF-IDF cosine ≈ 0.89), input $h$ của imdb nằm chủ yếu trong subspace của yelp (PEV cao). Routing margin → ~0 cho cặp này. Đây là **giới hạn cơ bản của mọi zero-replay parameter-free routing** — Bretagnolle–Huber bound đảm bảo $P_e \geq \frac{1}{2}(1 - \sqrt{1 - e^{-D_{KL}}})$ → unclassifiable khi $D_{KL}(p_\text{imdb} \| p_\text{yelp}) \approx 0$.
+**Capacity saturation.** GPM effective rank $k_t \approx 30$–$80$ dims/task. Worst case 15 tasks × 80 = 1200 > 512 = $d$ → null-space collapse ở tasks cuối. Dynamic threshold (C3) giảm nhẹ nhưng không giải hoàn toàn.

improve_gainlora/RUN_GUIDE_DIAGNOSTIC.md ADDED Viewed

	@@ -0,0 +1,208 @@

+# SpecRoute Diagnostic Run Guide
+## Quick Start — Priority Experiment
+**Run Long Sequence Order 3 (T5-small) first** — hardest benchmark with many same-domain tasks (yelp/amazon/imdb/sst2 all sentiment).
+### On H100 / A100:
+```bash
+cd /path/to/Continual/improve_gainlora
+# Using T5_small scripts (has --do_predict in all 15 tasks = full diagnostics)
+bash T5_small/gen_script_long_order3_t5_small_specroute.sh 0 google/flan-t5-small
+# OR using top-level scripts (--do_predict only in task 1)
+bash gen_script_long_order3_t5_specroute.sh 0 google/flan-t5-small
+```
+> **Recommendation**: Use `T5_small/` scripts — they have `--do_predict` in all 15 blocks,
+> so routing diagnostics (`routing_decisions.pt`) are saved for every task.
+> Top-level scripts only save routing data for task 1.
+### On Kaggle / Colab (T4 GPU):
+```bash
+bash setup_kaggle_colab.sh   # install deps
+bash T5_small/gen_script_long_order3_t5_small_specroute.sh 0 google/flan-t5-small
+```
+---
+## All Experiments
+| Script | Model | Benchmark | Priority |
+|--------|-------|-----------|----------|
+| `T5_small/gen_script_long_order3_t5_small_specroute.sh` | flan-t5-small | Long Seq (15 tasks, order 3) | **1st** |
+| `T5_small/gen_script_long_order4_t5_small_specroute.sh` | flan-t5-small | Long Seq (15 tasks, order 4) | 2nd |
+| `T5_small/gen_script_superni_order1_t5_small_specroute.sh` | flan-t5-small | SuperNI (15 tasks, order 1) | 3rd |
+| `T5_small/gen_script_superni_order2_t5_small_specroute.sh` | flan-t5-small | SuperNI (15 tasks, order 2) | 4th |
+| `gen_script_superni_order1_llama_specroute.sh` | Llama-2-7B | SuperNI order 1 | Later |
+| `gen_script_superni_order2_llama_specroute.sh` | Llama-2-7B | SuperNI order 2 | Later |
+---
+## CPI/OAP Parameters (already set in all scripts)
+| Parameter | Value | Meaning |
+|-----------|-------|---------|
+| `--cpi_gamma` | 0.5 | CPI contrastive weight (γ in Def. 3) |
+| `--oap_eta` | 0.5 | OAP projection strength (η in Def. 4) |
+| `--oap_beta_min` | 0.3 | OAP minimum retention floor (β_min in Thm. 4) |
+| `--oap_warmup` | 3 | Tasks before full OAP kicks in (empirical safeguard) |
+To change parameters:
+```bash
+python _patch_cpi_oap.py --gamma 0.5 --eta 0.5 --beta_min 0.3 --warmup 3
+# This patches top-level scripts only. For T5_small, edit manually or extend the script.
+```
+---
+## Diagnostic Outputs
+### 1. CPI/OAP Init Diagnostics (during training, task ≥ 2)
+**Log lines** (grep for these in stdout):
+```
+[DIAG-INIT] Layer N chunk K: ρ_l=0.1234, β_l=0.7000, SSE_before=0.45, SSE_after=0.12, λ_min+/r=0.000123, n_pos=6/8
+```
+| Metric | What it tells you | Healthy range |
+|--------|-------------------|---------------|
+| `ρ_l` | Domain proximity to old tasks (weighted) | 0.0–1.0 (higher = more overlap) |
+| `β_l` | OAP retention factor for layer l | ≥ β_min (0.3). Higher ρ → higher β |
+| `SSE_before` | Subspace overlap BEFORE OAP | Varies |
+| `SSE_after` | Subspace overlap AFTER OAP | Should be < SSE_before |
+| `λ_min+/r` | CPI routing margin (Thm. 3 lower bound) | > 1e-5. If ≈ 0, CPI failing |
+| `n_pos/total` | Positive eigenvalues in D_t | ≥ r (lora_r=8). If < r, falling back to Kaiming |
+**Saved file**: `<output_dir>/saved_weights/init_diagnostics.pt`
+- List of dicts, one per layer, with per-chunk diagnostics
+### 2. Routing Diagnostics (during prediction, requires `--do_predict`)
+**Log lines**:
+```
+[DIAG-ROUTING] Task amazon (id=1): routed_to_current=0.850 (850/1000) n_tasks=2
+  task_idx=0: 0.850
+  task_idx=1: 0.150
+```
+| Metric | What it tells you | Healthy range |
+|--------|-------------------|---------------|
+| `routed_to_current` | Fraction correctly routed to current task's expert | > 0.7 |
+| `p_e = 1 - routed_to_current` | Routing error rate (Thm. 4 input) | < 0.3 |
+**Saved file**: `<output_dir>/saved_weights/routing_decisions.pt`
+- Tensor of routing indices (0 = current task, 1+ = old tasks)
+### 3. Standard Metrics
+After all 15 tasks complete, stdout contains `predict_exact_match_for_<task>` lines.
+Use the scoring script:
+```bash
+# Parse from log file
+python ../parse_and_score_v2.py <logfile>
+```
+---
+## Post-Run Analysis
+### Quick analysis script:
+```bash
+python analyze_diagnostics.py gen_script_long_order3_t5_small_specroute
+```
+This reads all `init_diagnostics.pt` and `routing_decisions.pt` files and prints:
+- Per-task CPI/OAP health (ρ, β, SSE, λ_min+/r, eigenvalue count)
+- Per-task routing error (p_e)
+- Trend analysis (is p_e increasing? Is λ_min+/r collapsing?)
+- Summary table
+### Manual inspection:
+```python
+import torch
+# Load init diagnostics for task 5
+diag = torch.load('logs_and_outputs/gen_script_long_order3_t5_small_specroute/outputs/5-copa/saved_weights/init_diagnostics.pt')
+for layer_idx, layer_data in enumerate(diag):
+    for chunk_idx, d in layer_data.items():
+        print(f"Layer {layer_idx} chunk {chunk_idx}: ρ={d['rho_l']:.4f} β={d['beta_l']:.4f} λ+/r={d['lambda_min_pos_over_r']:.6f}")
+# Load routing decisions for task 10
+rd = torch.load('logs_and_outputs/gen_script_long_order3_t5_small_specroute/outputs/10-dbpedia/saved_weights/routing_decisions.pt')
+p_e = 1.0 - (rd == 0).float().mean().item()
+print(f"p_e = {p_e:.3f}")
+```
+---
+## Ablation Grid (§7.1)
+Run 4 configs to isolate CPI vs OAP contribution:
+| Config | γ | η | What it tests |
+|--------|---|---|---------------|
+| baseline | 0 | 0 | Pure spectral routing (no CPI, no OAP) |
+| CPI only | 0.5 | 0 | CPI init without OAP projection |
+| OAP only | 0 | 0.5 | OAP projection without CPI init |
+| full | 0.5 | 0.5 | Full SpecRoute (default) |
+To run the baseline config:
+```bash
+python _patch_cpi_oap.py --gamma 0 --eta 0
+# Then run the same gen_script with different --run_name to separate outputs
+```
+> Note: When γ=0, CPI falls back to standard Kaiming init.
+> When η=0, OAP is skipped (β_l always = 1, no projection applied).
+---
+## What to Report Back
+After the first experiment (Long Order 3 T5-small) finishes:
+1. **Final AP and Forgetting**: run `python ../parse_and_score_v2.py <logfile>`
+2. **Diagnostic summary**: run `python analyze_diagnostics.py gen_script_long_order3_t5_small_specroute`
+3. **Key questions to answer**:
+   - Is `p_e` staying below 0.3 across all 15 tasks?
+   - Is `λ_min+/r` maintaining a healthy margin (> 1e-5)?
+   - Does SSE decrease after OAP (SSE_after < SSE_before)?
+   - Is β_l respecting the β_min=0.3 floor?
+   - Any tasks with routing accuracy < 0.5? (indicates method failure)
+4. **If things look bad** (p_e > 0.3 and rising):
+   → We discuss Option 1: decoupled routing, prototype-based routing, or hierarchical routing
+5. **If things look good** (p_e < 0.3, stable):
+   → Proceed to remaining experiments (Order 4, SuperNI, Llama)
+---
+## Output Directory Structure
+```
+logs_and_outputs/gen_script_long_order3_t5_small_specroute/
+  outputs/
+    task_order.txt
+    1-yelp/
+      saved_weights/
+        routing_decisions.pt      # routing stats (if --do_predict)
+        init_diagnostics.pt       # CPI/OAP diagnostics (task >= 2)
+        spectral_signatures.pt    # frozen A matrices + calibration
+        ...
+    2-amazon/
+      saved_weights/
+        init_diagnostics.pt
+        routing_decisions.pt
+        ...
+    ...
+    15-wic/
+      saved_weights/
+        ...
+```

improve_gainlora/T5_small/gen_script_long_order3_t5_small_specroute.sh CHANGED Viewed

@@ -114,6 +114,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -170,6 +174,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -226,6 +234,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -282,6 +294,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -338,6 +354,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -394,6 +414,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -450,6 +474,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -506,6 +534,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -562,6 +594,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -618,6 +654,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -674,6 +714,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -730,6 +774,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -786,6 +834,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -842,6 +894,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -898,6 +954,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG

    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG

improve_gainlora/T5_small/gen_script_long_order3_t5_small_specroute_v10a.sh CHANGED Viewed

@@ -114,6 +114,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -173,6 +177,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -232,6 +240,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -291,6 +303,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -350,6 +366,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -409,6 +429,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -468,6 +492,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -527,6 +555,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -586,6 +618,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -645,6 +681,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -704,6 +744,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -763,6 +807,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -822,6 +870,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -881,6 +933,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -940,6 +996,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \

    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \

improve_gainlora/T5_small/gen_script_long_order3_t5_small_specroute_v10b.sh CHANGED Viewed

@@ -114,6 +114,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -171,6 +175,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -228,6 +236,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -285,6 +297,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -342,6 +358,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -399,6 +419,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -456,6 +480,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -513,6 +541,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -570,6 +602,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -627,6 +663,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -684,6 +724,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -741,6 +785,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -798,6 +846,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -855,6 +907,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -912,6 +968,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \

    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode grassmann \
    --threshold 0.995 \
    --transthreshold 0.995 \

improve_gainlora/T5_small/gen_script_long_order3_t5_small_specroute_v11.sh CHANGED Viewed

@@ -114,6 +114,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -173,6 +177,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -232,6 +240,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -291,6 +303,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -350,6 +366,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -409,6 +429,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -468,6 +492,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -527,6 +555,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -586,6 +618,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -645,6 +681,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -704,6 +744,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -763,6 +807,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -822,6 +870,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -881,6 +933,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
@@ -940,6 +996,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \

    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --routing_mode learned \
    --threshold 0.995 \
    --transthreshold 0.995 \

improve_gainlora/T5_small/gen_script_long_order4_t5_small_specroute.sh CHANGED Viewed

@@ -114,6 +114,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -170,6 +174,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -226,6 +234,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -282,6 +294,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -338,6 +354,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -394,6 +414,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -450,6 +474,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -506,6 +534,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -562,6 +594,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -618,6 +654,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -674,6 +714,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -730,6 +774,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -786,6 +834,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -842,6 +894,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -898,6 +954,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG

    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG

improve_gainlora/T5_small/gen_script_superni_order1_t5_small_specroute.sh CHANGED Viewed

@@ -125,6 +125,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -178,6 +182,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -231,6 +239,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -284,6 +296,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -337,6 +353,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -390,6 +410,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -443,6 +467,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -496,6 +524,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -549,6 +581,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -602,6 +638,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -655,6 +695,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -708,6 +752,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -761,6 +809,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -814,6 +866,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -867,6 +923,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG

    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG

improve_gainlora/T5_small/gen_script_superni_order2_t5_small_specroute.sh CHANGED Viewed

@@ -114,6 +114,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -166,6 +170,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -218,6 +226,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -270,6 +282,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -322,6 +338,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -374,6 +394,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -426,6 +450,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -478,6 +506,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -530,6 +562,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -582,6 +618,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -634,6 +674,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -686,6 +730,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -738,6 +786,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -790,6 +842,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -842,6 +898,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG

    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG

improve_gainlora/_patch_cpi_oap.py ADDED Viewed

	@@ -0,0 +1,72 @@

+"""Patch all specroute gen_scripts to add CPI+OAP parameters.
+Usage: python _patch_cpi_oap.py [--gamma 0.5] [--eta 0.5] [--beta_min 0.3] [--warmup 3]
+"""
+import re, os, sys
+BASE = os.path.dirname(os.path.abspath(__file__))
+# Default values
+GAMMA = 0.5
+ETA = 0.5
+BETA_MIN = 0.3
+WARMUP = 3
+# Parse CLI overrides
+args = sys.argv[1:]
+i = 0
+while i < len(args):
+    if args[i] == '--gamma' and i + 1 < len(args):
+        GAMMA = float(args[i+1]); i += 2
+    elif args[i] == '--eta' and i + 1 < len(args):
+        ETA = float(args[i+1]); i += 2
+    elif args[i] == '--beta_min' and i + 1 < len(args):
+        BETA_MIN = float(args[i+1]); i += 2
+    elif args[i] == '--warmup' and i + 1 < len(args):
+        WARMUP = int(args[i+1]); i += 2
+    else:
+        i += 1
+CPI_OAP_BLOCK = (
+    f'   --cpi_gamma {GAMMA} \\\n'
+    f'   --oap_eta {ETA} \\\n'
+    f'   --oap_beta_min {BETA_MIN} \\\n'
+    f'   --oap_warmup {WARMUP} \\\n'
+)
+# Find all specroute gen scripts
+scripts = sorted([
+    os.path.join(BASE, f) for f in os.listdir(BASE)
+    if f.startswith('gen_script_') and 'specroute' in f and f.endswith('.sh')
+])
+ANCHOR = re.compile(r'(   --model_name specroute \\\n)')
+for script_path in scripts:
+    with open(script_path) as f:
+        content = f.read()
+    if '--cpi_gamma' in content:
+        # Already patched — update values
+        content = re.sub(r'--cpi_gamma [\d.]+', f'--cpi_gamma {GAMMA}', content)
+        content = re.sub(r'--oap_eta [\d.]+', f'--oap_eta {ETA}', content)
+        content = re.sub(r'--oap_beta_min [\d.]+', f'--oap_beta_min {BETA_MIN}', content)
+        content = re.sub(r'--oap_warmup \d+', f'--oap_warmup {WARMUP}', content)
+        action = 'UPDATED'
+    else:
+        # Insert CPI+OAP params after --model_name specroute
+        matches = list(ANCHOR.finditer(content))
+        if not matches:
+            print(f'SKIP (no anchor): {os.path.basename(script_path)}')
+            continue
+        # Insert after each occurrence (multiple task blocks)
+        for m in reversed(matches):
+            insert_pos = m.end()
+            content = content[:insert_pos] + CPI_OAP_BLOCK + content[insert_pos:]
+        action = 'PATCHED'
+    with open(script_path, 'w') as f:
+        f.write(content)
+    n_blocks = len(ANCHOR.findall(content)) if action == 'UPDATED' else len(matches)
+    print(f'{action} ({n_blocks} blocks): {os.path.basename(script_path)} '
+          f'[gamma={GAMMA}, eta={ETA}, beta_min={BETA_MIN}, warmup={WARMUP}]')

improve_gainlora/analyze_diagnostics.py ADDED Viewed

	@@ -0,0 +1,205 @@

+#!/usr/bin/env python3
+"""Post-hoc diagnostic analysis for SpecRoute experiments.
+Reads init_diagnostics.pt and routing_decisions.pt from each task output
+and prints a comprehensive report.
+Usage:
+    python analyze_diagnostics.py <run_name>
+    python analyze_diagnostics.py gen_script_long_order3_t5_specroute
+Output:
+    Per-task CPI/OAP diagnostics (ρ_l, β_l, SSE, λ_min+, eigenvalue health)
+    Per-task routing accuracy (p_e estimate from routing decisions)
+"""
+import os, sys, torch, json
+import numpy as np
+def load_diag(task_dir):
+    """Load diagnostics from a task output directory."""
+    diag_path = os.path.join(task_dir, 'init_diagnostics.pt')
+    routing_path = os.path.join(task_dir, 'saved_weights', 'routing_decisions.pt')
+    init_diag = torch.load(diag_path, map_location='cpu') if os.path.exists(diag_path) else None
+    routing = torch.load(routing_path, map_location='cpu') if os.path.exists(routing_path) else None
+    return init_diag, routing
+def analyze_init_diag(init_diag, task_name, task_idx):
+    """Analyze CPI/OAP initialization diagnostics."""
+    if init_diag is None:
+        print(f'  [INIT] No init diagnostics (task 1 or missing file)')
+        return {}
+    all_rho, all_beta, all_sse_b, all_sse_a = [], [], [], []
+    all_lmin, all_npos, all_ntotal = [], [], []
+    for layer_idx, layer_diag in enumerate(init_diag):
+        for chunk_idx, d in layer_diag.items():
+            all_rho.append(d['rho_l'])
+            all_beta.append(d['beta_l'])
+            all_sse_b.append(d['sse_before'])
+            all_sse_a.append(d['sse_after'])
+            all_lmin.append(d['lambda_min_pos_over_r'])
+            all_npos.append(d['n_pos_eigvals'])
+            all_ntotal.append(d['n_total_eigvals'])
+    summary = {
+        'rho_l_mean': np.mean(all_rho),
+        'rho_l_max': np.max(all_rho),
+        'beta_l_mean': np.mean(all_beta),
+        'beta_l_min': np.min(all_beta),
+        'sse_before_mean': np.mean(all_sse_b),
+        'sse_after_mean': np.mean(all_sse_a),
+        'lambda_min_pos_over_r_mean': np.mean(all_lmin),
+        'lambda_min_pos_over_r_min': np.min(all_lmin),
+        'n_pos_mean': np.mean(all_npos),
+        'n_total': all_ntotal[0] if all_ntotal else 0,
+        'n_layers': len(init_diag),
+    }
+    print(f'  [INIT] ρ_l: mean={summary["rho_l_mean"]:.4f} max={summary["rho_l_max"]:.4f}')
+    print(f'         β_l: mean={summary["beta_l_mean"]:.4f} min={summary["beta_l_min"]:.4f}')
+    print(f'         SSE: {summary["sse_before_mean"]:.4f} → {summary["sse_after_mean"]:.4f} (OAP reduction)')
+    print(f'         λ_min+/r: mean={summary["lambda_min_pos_over_r_mean"]:.6f} min={summary["lambda_min_pos_over_r_min"]:.6f} (Thm3 routing margin)')
+    print(f'         Eigenvalues: {summary["n_pos_mean"]:.1f}/{summary["n_total"]} positive (avg across layers)')
+    # Health checks
+    if summary['lambda_min_pos_over_r_min'] < 1e-5:
+        print(f'  ⚠  WARNING: λ_min+/r very small → CPI routing margin weak for some layers')
+    if summary['n_pos_mean'] < 8:
+        print(f'  ⚠  WARNING: Few positive eigenvalues → CPI falling back to Kaiming on some layers')
+    return summary
+def analyze_routing(routing, task_name, task_idx, n_old_tasks):
+    """Analyze routing decisions for p_e estimation."""
+    if routing is None:
+        print(f'  [ROUTE] No routing data (first task or missing file)')
+        return {}
+    n_total = len(routing)
+    # In SpecRoute, current task is always index 0 at prediction time.
+    # But during eval, the model evaluates on the CURRENT task's test set,
+    # so correct routing = index 0 (current expert).
+    routed_to_current = (routing == 0).float().mean().item()
+    p_e = 1.0 - routed_to_current
+    summary = {
+        'n_samples': n_total,
+        'routed_to_current': routed_to_current,
+        'p_e': p_e,
+        'n_tasks_available': n_old_tasks + 1,
+    }
+    print(f'  [ROUTE] Routed to current: {routed_to_current:.3f} ({int((routing==0).sum())}/{n_total})')
+    print(f'          p_e (routing error): {p_e:.3f} (n_tasks={n_old_tasks+1})')
+    # Distribution
+    for t in range(n_old_tasks + 1):
+        frac = (routing == t).float().mean().item()
+        if frac > 0.001:
+            label = 'CURRENT' if t == 0 else f'old_{t}'
+            print(f'          task_idx={t} ({label}): {frac:.3f}')
+    # Health checks
+    if p_e > 0.3:
+        print(f'  ⚠  WARNING: High routing error (p_e={p_e:.3f}) — forgetting risk')
+    if p_e > 0.5:
+        print(f'  🚨 CRITICAL: Routing worse than random for 2-class! Method likely failing.')
+    return summary
+def main():
+    if len(sys.argv) < 2:
+        print('Usage: python analyze_diagnostics.py <run_name>')
+        print('  e.g.: python analyze_diagnostics.py gen_script_long_order3_t5_specroute')
+        sys.exit(1)
+    run_name = sys.argv[1]
+    base_dir = os.path.join('logs_and_outputs', run_name, 'outputs')
+    if not os.path.isdir(base_dir):
+        print(f'ERROR: {base_dir} not found')
+        sys.exit(1)
+    # Discover task directories (sorted by task index)
+    task_dirs = sorted([
+        d for d in os.listdir(base_dir)
+        if os.path.isdir(os.path.join(base_dir, d)) and d[0].isdigit()
+    ], key=lambda x: int(x.split('-')[0]))
+    print('=' * 70)
+    print(f'SpecRoute Diagnostic Report: {run_name}')
+    print(f'Tasks found: {len(task_dirs)}')
+    print('=' * 70)
+    all_summaries = []
+    for task_dir_name in task_dirs:
+        task_dir = os.path.join(base_dir, task_dir_name)
+        task_idx = int(task_dir_name.split('-')[0])
+        task_name = '-'.join(task_dir_name.split('-')[1:])
+        print(f'\n--- Task {task_idx}: {task_name} ---')
+        # Try saved_weights subdir first, then task_dir itself
+        sw_dir = os.path.join(task_dir, 'saved_weights')
+        diag_dir = sw_dir if os.path.isdir(sw_dir) else task_dir
+        init_diag_path = os.path.join(diag_dir, 'init_diagnostics.pt')
+        init_diag = torch.load(init_diag_path, map_location='cpu') if os.path.exists(init_diag_path) else None
+        routing_path = os.path.join(diag_dir, 'routing_decisions.pt')
+        routing = torch.load(routing_path, map_location='cpu') if os.path.exists(routing_path) else None
+        init_summary = analyze_init_diag(init_diag, task_name, task_idx)
+        route_summary = analyze_routing(routing, task_name, task_idx, task_idx - 1)
+        all_summaries.append({
+            'task_idx': task_idx, 'task_name': task_name,
+            'init': init_summary, 'routing': route_summary,
+        })
+    # Global summary table
+    print('\n' + '=' * 70)
+    print('SUMMARY TABLE')
+    print('=' * 70)
+    print(f'{"Task":<20} {"ρ_l":>6} {"β_l":>6} {"SSE→":>10} {"λ+/r":>8} {"p_e":>6} {"n_pos":>6}')
+    print('-' * 70)
+    for s in all_summaries:
+        i = s['init']
+        r = s['routing']
+        rho = f'{i.get("rho_l_mean", 0):.3f}' if i else '-'
+        beta = f'{i.get("beta_l_mean", 0):.3f}' if i else '-'
+        sse = f'{i.get("sse_before_mean", 0):.2f}→{i.get("sse_after_mean", 0):.2f}' if i else '-'
+        lmin = f'{i.get("lambda_min_pos_over_r_mean", 0):.5f}' if i else '-'
+        pe = f'{r.get("p_e", 0):.3f}' if r else '-'
+        npos = f'{i.get("n_pos_mean", 0):.0f}' if i else '-'
+        print(f'{s["task_name"]:<20} {rho:>6} {beta:>6} {sse:>10} {lmin:>8} {pe:>6} {npos:>6}')
+    # Trend analysis
+    p_e_values = [s['routing'].get('p_e', None) for s in all_summaries if s['routing']]
+    if len(p_e_values) >= 3:
+        first_half = np.mean(p_e_values[:len(p_e_values)//2])
+        second_half = np.mean(p_e_values[len(p_e_values)//2:])
+        print(f'\n[TREND] p_e first half: {first_half:.3f}, second half: {second_half:.3f}')
+        if second_half > first_half * 1.5:
+            print('  ⚠  p_e increasing significantly — routing degrades with more tasks')
+        else:
+            print('  ✓  p_e stable across tasks')
+    lmin_values = [s['init'].get('lambda_min_pos_over_r_min', None) for s in all_summaries if s['init']]
+    if len(lmin_values) >= 3:
+        first_half = np.mean(lmin_values[:len(lmin_values)//2])
+        second_half = np.mean(lmin_values[len(lmin_values)//2:])
+        print(f'[TREND] λ_min+/r first half: {first_half:.6f}, second half: {second_half:.6f}')
+        if second_half < first_half * 0.1:
+            print('  ⚠  CPI margin collapsing — may need stronger γ or new routing approach')
+    print('\n' + '=' * 70)
+if __name__ == '__main__':
+    main()

improve_gainlora/gen_script_long_order3_t5_specroute.sh CHANGED Viewed

@@ -110,6 +110,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -167,6 +171,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -224,6 +232,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -281,6 +293,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -338,6 +354,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -395,6 +415,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -452,6 +476,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -509,6 +537,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -566,6 +598,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -623,6 +659,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -680,6 +720,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -737,6 +781,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -794,6 +842,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -851,6 +903,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -908,6 +964,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG

    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG

improve_gainlora/gen_script_long_order4_t5_specroute.sh CHANGED Viewed

@@ -110,6 +110,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -167,6 +171,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -224,6 +232,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -281,6 +293,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -338,6 +354,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -395,6 +415,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -452,6 +476,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -509,6 +537,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -566,6 +598,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -623,6 +659,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -680,6 +720,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -737,6 +781,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -794,6 +842,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -851,6 +903,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -908,6 +964,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG

    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG

improve_gainlora/gen_script_superni_order1_llama_specroute.sh CHANGED Viewed

@@ -74,6 +74,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/1-task1572_samsum_summary/checkpoint*
@@ -124,6 +128,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/2-task363_sst2_polarity_classification/checkpoint*
@@ -174,6 +182,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/3-task1290_xsum_summarization/checkpoint*
@@ -224,6 +236,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/4-task181_outcome_extraction/checkpoint*
@@ -274,6 +290,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/5-task002_quoref_answer_generation/checkpoint*
@@ -324,6 +344,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/6-task1510_evalution_relation_extraction/checkpoint*
@@ -374,6 +398,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/7-task639_multi_woz_user_utterance_generation/checkpoint*
@@ -424,6 +452,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/8-task1729_personachat_generate_next/checkpoint*
@@ -474,6 +506,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/9-task073_commonsenseqa_answer_generation/checkpoint*
@@ -524,6 +560,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/10-task1590_diplomacy_text_generation/checkpoint*
@@ -574,6 +614,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/11-task748_glucose_reverse_cause_event_detection/checkpoint*
@@ -624,6 +668,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/12-task511_reddit_tifu_long_text_summarization/checkpoint*
@@ -674,6 +722,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/13-task591_sciq_answer_generation/checkpoint*
@@ -724,6 +776,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/14-task1687_sentiment140_classification/checkpoint*
@@ -774,6 +830,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/15-task875_emotion_classification/checkpoint*

    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/1-task1572_samsum_summary/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/2-task363_sst2_polarity_classification/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/3-task1290_xsum_summarization/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/4-task181_outcome_extraction/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/5-task002_quoref_answer_generation/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/6-task1510_evalution_relation_extraction/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/7-task639_multi_woz_user_utterance_generation/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/8-task1729_personachat_generate_next/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/9-task073_commonsenseqa_answer_generation/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/10-task1590_diplomacy_text_generation/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/11-task748_glucose_reverse_cause_event_detection/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/12-task511_reddit_tifu_long_text_summarization/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/13-task591_sciq_answer_generation/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/14-task1687_sentiment140_classification/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/15-task875_emotion_classification/checkpoint*

improve_gainlora/gen_script_superni_order1_llama_specroute_p100.sh CHANGED Viewed

@@ -51,6 +51,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/1-task1572_samsum_summary/checkpoint*
@@ -101,6 +105,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/2-task363_sst2_polarity_classification/checkpoint*
@@ -151,6 +159,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/3-task1290_xsum_summarization/checkpoint*
@@ -201,6 +213,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/4-task181_outcome_extraction/checkpoint*
@@ -251,6 +267,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/5-task002_quoref_answer_generation/checkpoint*
@@ -301,6 +321,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/6-task1510_evalution_relation_extraction/checkpoint*
@@ -351,6 +375,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/7-task639_multi_woz_user_utterance_generation/checkpoint*
@@ -401,6 +429,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/8-task1729_personachat_generate_next/checkpoint*
@@ -451,6 +483,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/9-task073_commonsenseqa_answer_generation/checkpoint*
@@ -501,6 +537,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/10-task1590_diplomacy_text_generation/checkpoint*
@@ -551,6 +591,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/11-task748_glucose_reverse_cause_event_detection/checkpoint*
@@ -601,6 +645,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/12-task511_reddit_tifu_long_text_summarization/checkpoint*
@@ -651,6 +699,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/13-task591_sciq_answer_generation/checkpoint*
@@ -701,6 +753,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/14-task1687_sentiment140_classification/checkpoint*
@@ -751,6 +807,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/15-task875_emotion_classification/checkpoint*

    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/1-task1572_samsum_summary/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/2-task363_sst2_polarity_classification/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/3-task1290_xsum_summarization/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/4-task181_outcome_extraction/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/5-task002_quoref_answer_generation/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/6-task1510_evalution_relation_extraction/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/7-task639_multi_woz_user_utterance_generation/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/8-task1729_personachat_generate_next/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/9-task073_commonsenseqa_answer_generation/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/10-task1590_diplomacy_text_generation/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/11-task748_glucose_reverse_cause_event_detection/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/12-task511_reddit_tifu_long_text_summarization/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/13-task591_sciq_answer_generation/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/14-task1687_sentiment140_classification/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/15-task875_emotion_classification/checkpoint*

improve_gainlora/gen_script_superni_order1_t5_specroute.sh CHANGED Viewed

@@ -114,6 +114,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -169,6 +173,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -224,6 +232,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -279,6 +291,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -334,6 +350,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -389,6 +409,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -444,6 +468,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -499,6 +527,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -554,6 +586,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -609,6 +645,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -664,6 +704,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -719,6 +763,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -774,6 +822,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -829,6 +881,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -884,6 +940,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG

    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG

improve_gainlora/gen_script_superni_order2_llama_specroute.sh CHANGED Viewed

@@ -74,6 +74,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/1-task748_glucose_reverse_cause_event_detection/checkpoint*
@@ -124,6 +128,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/2-task073_commonsenseqa_answer_generation/checkpoint*
@@ -174,6 +182,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/3-task1590_diplomacy_text_generation/checkpoint*
@@ -224,6 +236,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/4-task639_multi_woz_user_utterance_generation/checkpoint*
@@ -274,6 +290,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/5-task1572_samsum_summary/checkpoint*
@@ -324,6 +344,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/6-task1687_sentiment140_classification/checkpoint*
@@ -374,6 +398,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/7-task591_sciq_answer_generation/checkpoint*
@@ -424,6 +452,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/8-task363_sst2_polarity_classification/checkpoint*
@@ -474,6 +506,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/9-task1510_evalution_relation_extraction/checkpoint*
@@ -524,6 +560,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/10-task1729_personachat_generate_next/checkpoint*
@@ -574,6 +614,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/11-task181_outcome_extraction/checkpoint*
@@ -624,6 +668,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/12-task511_reddit_tifu_long_text_summarization/checkpoint*
@@ -674,6 +722,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/13-task002_quoref_answer_generation/checkpoint*
@@ -724,6 +776,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/14-task1290_xsum_summarization/checkpoint*
@@ -774,6 +830,10 @@ deepspeed --include $DS_INCLUDE --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/15-task875_emotion_classification/checkpoint*

    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/1-task748_glucose_reverse_cause_event_detection/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/2-task073_commonsenseqa_answer_generation/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/3-task1590_diplomacy_text_generation/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/4-task639_multi_woz_user_utterance_generation/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/5-task1572_samsum_summary/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/6-task1687_sentiment140_classification/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/7-task591_sciq_answer_generation/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/8-task363_sst2_polarity_classification/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/9-task1510_evalution_relation_extraction/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/10-task1729_personachat_generate_next/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/11-task181_outcome_extraction/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/12-task511_reddit_tifu_long_text_summarization/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/13-task002_quoref_answer_generation/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/14-task1290_xsum_summarization/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/15-task875_emotion_classification/checkpoint*

improve_gainlora/gen_script_superni_order2_llama_specroute_p100.sh CHANGED Viewed

@@ -51,6 +51,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/1-task748_glucose_reverse_cause_event_detection/checkpoint*
@@ -101,6 +105,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/2-task073_commonsenseqa_answer_generation/checkpoint*
@@ -151,6 +159,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/3-task1590_diplomacy_text_generation/checkpoint*
@@ -201,6 +213,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/4-task639_multi_woz_user_utterance_generation/checkpoint*
@@ -251,6 +267,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/5-task1572_samsum_summary/checkpoint*
@@ -301,6 +321,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/6-task1687_sentiment140_classification/checkpoint*
@@ -351,6 +375,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/7-task591_sciq_answer_generation/checkpoint*
@@ -401,6 +429,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/8-task363_sst2_polarity_classification/checkpoint*
@@ -451,6 +483,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/9-task1510_evalution_relation_extraction/checkpoint*
@@ -501,6 +537,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/10-task1729_personachat_generate_next/checkpoint*
@@ -551,6 +591,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/11-task181_outcome_extraction/checkpoint*
@@ -601,6 +645,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/12-task511_reddit_tifu_long_text_summarization/checkpoint*
@@ -651,6 +699,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/13-task002_quoref_answer_generation/checkpoint*
@@ -701,6 +753,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/14-task1290_xsum_summarization/checkpoint*
@@ -751,6 +807,10 @@ deepspeed --num_gpus 1 --master_port 49500 src/run_llama.py \
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/15-task875_emotion_classification/checkpoint*

    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/1-task748_glucose_reverse_cause_event_detection/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/2-task073_commonsenseqa_answer_generation/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/3-task1590_diplomacy_text_generation/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/4-task639_multi_woz_user_utterance_generation/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/5-task1572_samsum_summary/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/6-task1687_sentiment140_classification/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/7-task591_sciq_answer_generation/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/8-task363_sst2_polarity_classification/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/9-task1510_evalution_relation_extraction/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/10-task1729_personachat_generate_next/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/11-task181_outcome_extraction/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/12-task511_reddit_tifu_long_text_summarization/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/13-task002_quoref_answer_generation/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/14-task1290_xsum_summarization/checkpoint*
    --data_replay_freq -1 \
    --chunk 4 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995
 rm -rf logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/15-task875_emotion_classification/checkpoint*

improve_gainlora/gen_script_superni_order2_t5_specroute.sh CHANGED Viewed

@@ -111,6 +111,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -165,6 +169,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -219,6 +227,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -273,6 +285,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -327,6 +343,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -381,6 +401,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -435,6 +459,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -489,6 +517,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -543,6 +575,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -597,6 +633,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -651,6 +691,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -705,6 +749,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -759,6 +807,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -813,6 +865,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
@@ -867,6 +923,10 @@ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG

    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG
    --data_replay_freq -1 \
    --mlp_hidden_dim 100 \
    --model_name specroute \
+   --cpi_gamma 0.5 \
+   --oap_eta 0.5 \
+   --oap_beta_min 0.3 \
+   --oap_warmup 3 \
    --threshold 0.995 \
    --transthreshold 0.995 \
    $FP16_FLAG

improve_gainlora/improve_gainlora.tex ADDED Viewed

	@@ -0,0 +1,231 @@

+\documentclass[border=10pt]{standalone}
+\usepackage{tikz}
+\usetikzlibrary{shapes.geometric, arrows.meta, calc}
+\usepackage{pifont}
+\usepackage{amsmath}
+\usepackage{fontawesome5}
+\begin{document}
+\begin{tikzpicture}[
+    >=stealth,
+    font=\sffamily\footnotesize,
+    % === REUSED STYLES FROM GAINLORA ===
+    lm_outer/.style={rectangle, draw=blue!60!black, fill=blue!15, thick, rounded corners=4mm, minimum width=4cm, minimum height=5.5cm, align=center},
+    lm_inner/.style={rectangle, fill=cyan!5, rounded corners=2mm, minimum width=3.4cm, minimum height=4cm},
+    input_box/.style={rectangle, draw=orange!50, fill=orange!20, rounded corners=1mm, text width=5cm, inner sep=4pt, align=left},
+    ans_box/.style={rectangle, draw=orange!50, fill=orange!20, font=\bfseries, inner sep=4pt},
+    q_h_box/.style={rectangle, draw=yellow!80!orange, fill=yellow!30, rounded corners=1.5mm, minimum width=3cm, minimum height=4mm},
+    trap_A_old/.style={trapezium, draw=teal!80!black, fill=green!10, trapezium angle=65, thick, minimum width=1.5cm, minimum height=0.7cm, align=center, shape border rotate=180},
+    trap_B_old/.style={trapezium, draw=teal!80!black, fill=green!10, trapezium angle=65, thick, minimum width=1.5cm, minimum height=0.7cm, align=center},
+    trap_A_new/.style={trapezium, draw=blue!80!black, fill=blue!10, trapezium angle=65, thick, minimum width=1.5cm, minimum height=0.7cm, align=center, shape border rotate=180},
+    trap_B_new/.style={trapezium, draw=blue!80!black, fill=blue!10, trapezium angle=65, thick, minimum width=1.5cm, minimum height=0.7cm, align=center},
+    var_w/.style={rectangle, draw=blue!80!black, fill=cyan!10, minimum size=6mm, inner sep=2pt, align=center},
+    var_a_old/.style={rectangle, draw=teal!80!black, fill=green!15, minimum size=5mm, inner sep=2pt},
+    var_a_new/.style={rectangle, draw=blue!80!black, fill=blue!15, minimum size=5mm, inner sep=2pt},
+    step_box/.style={rectangle, draw=blue!40, fill=blue!10, text width=2.8cm, align=left, inner sep=3pt},
+    plus_op/.style={circle, draw=black, thick, fill=white, inner sep=0pt, minimum size=4.5mm},
+    times_op/.style={circle, draw=black, thick, fill=white, inner sep=0pt, minimum size=4.5mm},
+    % === NEW STYLES FOR SPECROUTE ===
+    spectral_old/.style={rectangle, draw=violet!80!black, thick, fill=violet!8, rounded corners=2mm, minimum width=2cm, minimum height=1.4cm, align=center},
+    spectral_new/.style={rectangle, draw=red!70!black, thick, fill=violet!8, rounded corners=2mm, minimum width=2cm, minimum height=1.4cm, align=center},
+    c5_badge/.style={rectangle, draw=orange!80!black, thick, fill=yellow!40, rounded corners=1.5mm, inner sep=2pt, font=\scriptsize\bfseries},
+    prop_box/.style={rectangle, draw=violet!70!black, fill=violet!8, text width=1.8cm, align=center, minimum height=0.9cm},
+    calib_label/.style={font=\scriptsize, text=orange!70!black, align=center},
+    argmax_badge/.style={rectangle, draw=red!70!black, thick, fill=red!8, rounded corners=2mm, inner sep=4pt, align=center}
+]
+% =======================================================
+% PART (a) - SPECROUTE EXPANDABLE LORA ARCHITECTURE
+% =======================================================
+% 1. LM Box & Inputs (same structure as GainLoRA)
+\node[input_box] (input_a) at (0, 0) {Someone who had a very bad flight might be given a trip in this to make up for it?\\ \textbf{Option:} (A)first class (B)propitious (C)reputable (D)one (E)sufficient};
+\node[font=\large] at (-3, 0) {$\boldsymbol{x}$};
+\node[lm_outer] (lm) at (0, 4.5) {};
+\node[font=\bfseries] at (0, 6.8) {Language Models};
+\node[lm_inner] (lm_in) at (0, 4.2) {};
+\node[ans_box] (ans) at (0, 8.2) {Answer: A};
+\draw[->, thick, line width=1.2pt] (0, 7.25) -- (ans.south);
+\node[q_h_box] (h) at (0, 2.7) {};
+\node at (-1.8, 2.7) {$\boldsymbol{h}$};
+\node[q_h_box] (q) at (0, 5.7) {};
+\node at (-1.8, 5.7) {$\boldsymbol{q}$};
+\node[plus_op] (lm_plus) at (0, 4.2) {$\boldsymbol{+}$};
+\node[rectangle, draw=black, fill=white, align=center, font=\scriptsize, minimum width=1.3cm] (W) at (-1.1, 4.2) {Pre-trained\\Weights\\$\boldsymbol{W}$\\\textcolor{cyan}{\ding{101}}};
+\node[var_w] (wt_small) at (1.2, 4.2) {$w_t$};
+\draw[->, thick, line width=1.2pt] (input_a.north) -- (h.south);
+\draw[->, thick, line width=1.2pt] (h.north) -- (lm_plus.south);
+\draw[->, thick, line width=1.2pt] (lm_plus.north) -- (q.south);
+% 2. Big Expandable Box
+\draw[black, thick, rounded corners=4mm] (3.5, -0.5) rectangle (16.5, 9);
+\draw[dashed, thick] (1.5, 4.5) -- (3.5, 9);
+\draw[dashed, thick] (1.5, 3.9) -- (3.5, -0.5);
+% Legend
+\node[plus_op, minimum size=4mm] at (5.5, 8.5) {$\boldsymbol{+}$};
+\node at (6.5, 8.5) {Addition};
+\node[times_op, minimum size=4mm] at (8, 8.5) {$\boldsymbol{\times}$};
+\node at (9.2, 8.5) {Multiplication};
+\node[text=cyan, font=\large] at (10.8, 8.5) {\ding{101}};
+\node at (11.6, 8.5) {: Frozen};
+\node[c5_badge] at (13, 8.5) {C5};
+\node[font=\footnotesize] at (14.5, 8.5) {: Data-Informed};
+% Output
+\node[var_w, minimum size=7mm] (wt_big) at (9.5, 7.5) {$w_t$};
+\node[plus_op] (big_plus) at (9.5, 6) {$\boldsymbol{+}$};
+\draw[->, thick, line width=1.5pt] (big_plus.north) -- (wt_big.south);
+% Bus line
+\draw[thick, line width=1.5pt] (5.5, 4.8) -- (13.5, 4.8);
+\draw[thick, line width=1.5pt] (9.5, 4.8) -- (big_plus.south);
+% Background Old/New
+\fill[green!15, rounded corners=2mm] (4, 0) rectangle (9.2, 4.5);
+\node[font=\bfseries] at (6.6, -0.2) {Old Branches};
+\fill[blue!15, rounded corners=2mm] (10.5, 0) rectangle (13.5, 4.5);
+\node[font=\bfseries] at (12, -0.2) {New Branch};
+% Branch 1 (X = 5.5) -- CHANGED: alpha_1 instead of a_1
+\node[var_a_old] (a1) at (4.5, 4) {$\alpha_1$};
+\node[times_op] (t1_top) at (5.5, 4) {$\boldsymbol{\times}$};
+\node[trap_A_old] (A1) at (5.5, 2.7) {$\boldsymbol{A_1}$\\ \tiny\textcolor{cyan}{\ding{101}}};
+\node[times_op] (t1_bot) at (5.5, 1.5) {$\boldsymbol{\times}$};
+\node[trap_B_old] (B1) at (5.5, 0.5) {$\boldsymbol{B_1}$\\ \tiny\textcolor{cyan}{\ding{101}}};
+\draw[->, thick, line width=1.5pt] (B1.north) -- (t1_bot.south);
+\draw[->, thick, line width=1.5pt] (t1_bot.north) -- (A1.south);
+\draw[->, thick, line width=1.5pt] (A1.north) -- (t1_top.south);
+\draw[->, thick, line width=1.5pt] (t1_top.north) -- (5.5, 4.8);
+\draw[->, thick, line width=1.2pt] (a1.east) -- (t1_top.west);
+% Dots
+\node[font=\large] at (7.2, 2.7) {$\boldsymbol{\dots}$};
+\node[font=\large] at (7.2, 0.5) {$\boldsymbol{\dots}$};
+% Branch t-1 (X = 8.4) -- CHANGED: alpha_{t-1}
+\node[var_a_old] (atm1) at (7.4, 4) {$\alpha_{t\text{-}1}$};
+\node[times_op] (ttm1_top) at (8.4, 4) {$\boldsymbol{\times}$};
+\node[trap_A_old] (Atm1) at (8.4, 2.7) {$\boldsymbol{A_{t-1}}$\\ \tiny\textcolor{cyan}{\ding{101}}};
+\node[times_op] (ttm1_bot) at (8.4, 1.5) {$\boldsymbol{\times}$};
+\node[trap_B_old] (Btm1) at (8.4, 0.5) {$\boldsymbol{B_{t-1}}$\\ \tiny\textcolor{cyan}{\ding{101}}};
+\draw[->, thick, line width=1.5pt] (Btm1.north) -- (ttm1_bot.south);
+\draw[->, thick, line width=1.5pt] (ttm1_bot.north) -- (Atm1.south);
+\draw[->, thick, line width=1.5pt] (Atm1.north) -- (ttm1_top.south);
+\draw[->, thick, line width=1.5pt] (ttm1_top.north) -- (8.4, 4.8);
+\draw[->, thick, line width=1.2pt] (atm1.east) -- (ttm1_top.west);
+% Branch t (X = 12.5) -- CHANGED: alpha_t, C5 badge on A_t, A_t now frozen
+\node[var_a_new] (at) at (11.5, 4) {$\alpha_t$};
+\node[times_op] (tt_top) at (12.5, 4) {$\boldsymbol{\times}$};
+\node[trap_A_new] (At) at (12.5, 2.7) {$\boldsymbol{A_t}$\\ \tiny\textcolor{cyan}{\ding{101}}};
+\node[c5_badge] at (13.8, 2.7) {C5};
+\node[times_op] (tt_bot) at (12.5, 1.5) {$\boldsymbol{\times}$};
+\node[trap_B_new] (Bt) at (12.5, 0.5) {$\boldsymbol{B_t}$\\ \tiny\textcolor{red}{\faFire}};
+\draw[->, thick, line width=1.5pt] (Bt.north) -- (tt_bot.south);
+\draw[->, thick, line width=1.5pt] (tt_bot.north) -- (At.south);
+\draw[->, thick, line width=1.5pt] (At.north) -- (tt_top.south);
+\draw[->, thick, line width=1.5pt] (tt_top.north) -- (12.5, 4.8);
+\draw[->, thick, line width=1.2pt] (at.east) -- (tt_top.west);
+% Step annotations -- CHANGED for SpecRoute
+\node[step_box] (step1) at (15, 7.2) {\textbf{1.} C5 Data-Informed\\Init for $A_t$};
+\draw[->, thick] (step1.west) -- (wt_big.east);
+\node[step_box] (step2) at (15, 4.6) {\textbf{2.} Hard Top-1\\Spectral Routing};
+\node[step_box] (step3) at (15, 2) {\textbf{3.} Training $B_t$\\for new task};
+\node[font=\large\bfseries] at (8, -1.5) {(a) Expandable LoRA Architecture in SpecRoute};
+% =======================================================
+% PART (b) - SPECTRAL ROUTING (replaces Gating Modules)
+% =======================================================
+\begin{scope}[shift={(21.5, 0)}]
+    % Input
+    \node[input_box] (input_b) at (4.5, 0) {Someone who had a very bad flight might be given a trip in this to make up for it?\\ \textbf{Option:} (A)first class (B)propitious (C)reputable (D)one (E)sufficient};
+    \node[font=\large] at (1.5, 0) {$\boldsymbol{x}$};
+    % Frozen embedding label
+    \node[font=\scriptsize, text=blue!70!black] at (4.5, 1.3) {$h = \text{embed}(x)$ \textcolor{cyan}{\ding{101}}};
+    % Backgrounds for old/new spectral blocks
+    \fill[violet!8, rounded corners=2mm] (0.2, 1.6) rectangle (6, 4.2);
+    \fill[blue!8, rounded corners=2mm] (7.3, 1.6) rectangle (10.7, 4.2);
+    % Spectral Affinity Blocks (replacing gating modules g_i)
+    % OLD expert 1 -- ALL FROZEN, no fire icon
+    \node[spectral_old] (s1) at (1.5, 2.8) {$\boldsymbol{A_1}$\;\textcolor{cyan}{\ding{101}}\\[4pt] \scriptsize $\alpha_1(h)\!=\!\frac{\|A_1 h\|^2}{r\|h\|^2}$};
+    % Dots
+    \node[font=\large] at (3.2, 2.8) {$\boldsymbol{\dots}$};
+    % OLD expert t-1
+    \node[spectral_old] (stm1) at (4.8, 2.8) {$\boldsymbol{A_{t\text{-}1}}$\;\textcolor{cyan}{\ding{101}}\\[4pt] \scriptsize $\alpha_{t\text{-}1}(h)$};
+    % NEW expert t -- frozen + C5, NO fire icon (parameter-free routing)
+    \node[spectral_new] (st) at (9, 2.8) {$\boldsymbol{A_t}$\;\textcolor{cyan}{\ding{101}}\\[4pt] \scriptsize $\alpha_t(h)$};
+    % C5 badge on new expert
+    \node[c5_badge] at (10.4, 3.3) {C5};
+    % Arrows from input to spectral blocks
+    \draw[->, thick, line width=1.2pt] (4.5, 1) -- (4.5, 1.5) -- (4.8, 1.5) -- (4.8, 2.1);
+    \draw[->, thick, line width=1.2pt] (2.5, 1) -- (2.5, 1.5) -- (1.5, 1.5) -- (1.5, 2.1);
+    \draw[->, thick, line width=1.2pt] (6.5, 1) -- (6.5, 1.5) -- (9, 1.5) -- (9, 2.1);
+    % Calibration Normalization layer
+    \node[rectangle, draw=orange!60!black, fill=orange!10, rounded corners=1.5mm,
+          minimum width=9.5cm, minimum height=0.6cm, align=center] (calib_bar) at (5, 4.7) {
+        \footnotesize Calibration: $\alpha_t^{\text{cal}}(h) = \alpha_t(h)\,/\,\hat{\mu}_t$ \quad\textcolor{gray}{(EMA normalization)}
+    };
+    % Arrows from spectral blocks to calibration
+    \draw[->, thick, line width=1.2pt] (1.5, 3.5) -- (1.5, 4.4);
+    \draw[->, thick, line width=1.2pt] (4.8, 3.5) -- (4.8, 4.4);
+    \draw[->, thick, line width=1.2pt] (9, 3.5) -- (9, 4.4);
+    % argmax routing box
+    \node[argmax_badge] (argmax) at (5, 5.7) {
+        \footnotesize Hard Top-1: $\;t^* = \arg\max_t\;\alpha_t^{\text{cal}}(h)$
+    };
+    \draw[->, thick, line width=1.2pt] (calib_bar.north) -- (argmax.south);
+    % Output routing weights
+    \node[var_a_old] (a1_b) at (1.5, 6.8) {$\alpha_1^{\text{cal}}$};
+    \node[var_a_old] (atm1_b) at (4.8, 6.8) {$\alpha_{t\text{-}1}^{\text{cal}}$};
+    \node[var_a_new] (at_b) at (9, 6.8) {$\alpha_t^{\text{cal}}$};
+    \draw[->, thick, line width=1.2pt] (argmax.north) -- ++(0, 0.3) -| (a1_b.south);
+    \draw[->, thick, line width=1.2pt] (argmax.north) -- ++(0, 0.3) -| (atm1_b.south);
+    \draw[->, thick, line width=1.2pt] (argmax.north) -- ++(0, 0.3) -| (at_b.south);
+    % Properties boxes (replacing Initialization/Updating/Imposing constraints)
+    \node[prop_box] (p1) at (1.5, 8) {No learnable\\parameters};
+    \node[prop_box] (p2) at (4.8, 8) {Drift-free\\routing};
+    \node[prop_box] (p3) at (9, 8) {C5 Data-\\Informed Init};
+    % Dashed box around first two properties (like original constraints box)
+    \draw[dashed, violet!80!black, thick, rounded corners=2mm]
+        (0, 7.3) -- (6.2, 7.3) -- (6.2, 7.8) -- (7, 8) -- (6.2, 8.2) -- (6.2, 8.8) -- (0, 8.8) -- cycle;
+    % Arrow from C5 property to spectral block (like imposing constraints feedback)
+    \draw[->, thick, line width=1.2pt] (p3.east) -- ++(1,0) |- (st.east);
+    \node[font=\large\bfseries] at (5, -1.5) {(b) Spectral Routing in SpecRoute};
+\end{scope}
+\end{tikzpicture}
+\end{document}

improve_gainlora/src/cl_trainer_specroute.py CHANGED Viewed

@@ -146,7 +146,9 @@ class SpecRoute_Trainer(Seq2SeqTrainer):
                  compute_metrics=None, callbacks=None,
                  lambda_entropy=0.0, use_preconditioning=False,
                  precond_eps=1e-6, entropy_warmup_ratio=0.1,
-                 n_batches_c5=100, previous_lora_path=None):
         self.previous_lora_path = previous_lora_path
         if callbacks is None:
             callbacks = []
@@ -169,6 +171,13 @@ class SpecRoute_Trainer(Seq2SeqTrainer):
         # C5: Data-Informed Subspace Initialization
         self.n_batches_c5 = n_batches_c5
         self._task_covariance = []  # list of {chunk_index: cov_tensor} per layer
     def _save(self, output_dir=None, state_dict=None):
         # T5 shared embeddings are incompatible with safetensors; force pytorch format
@@ -497,12 +506,70 @@ class SpecRoute_Trainer(Seq2SeqTrainer):
                 break
         return reg_matrix, reg_trans_matrix, eval(local_dir.split('-')[0]) - 1
     def get_reg_matrix(self):
         """
         V11: Project current LoRA A into null-space of old tasks' GPM bases.
         Also re-initialize prompt_key for learned routing (ROOT-style SVD).
         """
         self.feature_list, self.feature_trans_list, self._cur_task = self.load_previous_reg_matrix()
         # ================================================================
         # V11: Prompt-key re-initialization (ROOT-style)
@@ -546,61 +613,186 @@ class SpecRoute_Trainer(Seq2SeqTrainer):
                     ).to("cuda:0")
                 self.feature_mat.append(feature_mat)
-                # C5: Data-Informed Subspace Initialization
-                # Replace random A_t with optimal subspace from Constrained PCA.
-                # Eigenvectors of Q@C_t@Q are in null(P_old) by construction.
                 if self._task_covariance and i < len(self._task_covariance):
                     r = module.lora_q.lora_A.data.shape[0]  # LoRA rank
                     for index in self.feature_list[i].keys():
                         C_t   = self._task_covariance[i][index]       # [step, step]
                         P_old = feature_mat[index]                     # [step, step]
-                        Q     = torch.eye(module.step, device=P_old.device) - P_old
                         C_tilde = Q @ C_t.to(P_old.device) @ Q
                         # Enforce symmetry and add tiny jitter for numerical stability
-                        C_tilde = (C_tilde + C_tilde.T) * 0.5
-                        # Guard against NaNs or Infs that might cause linalg.eigh to fail
-                        if torch.isnan(C_tilde).any() or torch.isinf(C_tilde).any():
-                            print(f'[C5] WARNING: Layer {i+1} index {index} contains NaN/Inf in C_tilde. Cleaning.')
-                            C_tilde = torch.nan_to_num(C_tilde, nan=0.0, posinf=0.0, neginf=0.0)
-                        # Add tiny diagonal jitter to prevent ill-conditioned matrix errors (error code 641)
-                        # Scaling jitter to mean magnitude of C_tilde
-                        jitter_scale = 1e-10 * (C_tilde.abs().mean() + 1e-12)
-                        C_tilde += jitter_scale * torch.eye(C_tilde.shape[0], device=C_tilde.device)
                         try:
-                            # eigh returns ascending eigenvalues; take last r (largest)
-                            eigvals, eigvecs = torch.linalg.eigh(C_tilde.float())
                         except torch._C._LinAlgError as e:
-                            print(f'[C5] WARNING: Layer {i+1} index {index} - linalg.eigh failed: {e}. Falling back.')
                             continue
-                        # Fallback: if null-space signal is degenerate, keep Kaiming init
-                        if eigvals[-1].item() < 1e-6:
-                            print(f'[C5] Layer {i+1} index {index}: max_eigval={eigvals[-1].item():.2e} < 1e-6, fallback to Kaiming+InfLoRA')
                             continue
-                        top_eigvecs = eigvecs[:, -r:].flip(dims=[1])  # [step, r]
                         A_init = top_eigvecs.T  # [r, step]
                         dtype  = module.lora_q.lora_A.data.dtype
                         sl     = slice(index * module.step, (index + 1) * module.step)
                         module.lora_q.lora_A.data[:, sl].copy_(A_init.to(dtype))
                         module.lora_v.lora_A.data[:, sl].copy_(A_init.to(dtype))
-                    print(f'[C5] Layer {i+1}: A_t initialized via Constrained PCA.')
-                # InfLoRA null-space projection (near no-op after C5, still applied
-                # for numerical correctness and to enforce InfLoRA invariant exactly)
                 for index in self.feature_list[i].keys():
-                    module.lora_q.lora_A.data[:, index*module.step:(index+1)*module.step].copy_(
-                        module.lora_q.lora_A.data[:, index*module.step:(index+1)*module.step]
-                        - torch.mm(
-                            module.lora_q.lora_A.data[:, index*module.step:(index+1)*module.step],
                             feature_mat[index]
                         )
                     )
-                    module.lora_v.lora_A.data[:, index*module.step:(index+1)*module.step].copy_(
-                        module.lora_v.lora_A.data[:, index*module.step:(index+1)*module.step]
-                        - torch.mm(
-                            module.lora_v.lora_A.data[:, index*module.step:(index+1)*module.step],
                             feature_mat[index]
                         )
                     )
@@ -775,6 +967,20 @@ class SpecRoute_Trainer(Seq2SeqTrainer):
         for i in range(len(self.feature_list)):
             torch.save(self.feature_list[i], os.path.join(self.args.output_dir, 'reg_{}.pt'.format(i)))
         # Save trans_input GPM bases
         if getattr(self.model.encoder, "routing_mode", "") == "learned" and hasattr(self, "feature_trans_list"):
             os.makedirs(os.path.join(self.args.output_dir, 'trans_input'), exist_ok=True)

                  compute_metrics=None, callbacks=None,
                  lambda_entropy=0.0, use_preconditioning=False,
                  precond_eps=1e-6, entropy_warmup_ratio=0.1,
+                 n_batches_c5=100, previous_lora_path=None,
+                 cpi_gamma=0.0,
+                 oap_eta=0.0, oap_beta_min=0.3, oap_warmup=3):
         self.previous_lora_path = previous_lora_path
         if callbacks is None:
             callbacks = []
         # C5: Data-Informed Subspace Initialization
         self.n_batches_c5 = n_batches_c5
         self._task_covariance = []  # list of {chunk_index: cov_tensor} per layer
+        # CPI: Contrastive Projected Initialization
+        self.cpi_gamma = cpi_gamma
+        self._old_covariances = []  # list of per-task covariance lists loaded from disk
+        # OAP: Overlap-Aware Projection
+        self.oap_eta = oap_eta
+        self.oap_beta_min = oap_beta_min
+        self.oap_warmup = oap_warmup  # T_warmup: tasks before full OAP kicks in
     def _save(self, output_dir=None, state_dict=None):
         # T5 shared embeddings are incompatible with safetensors; force pytorch format
                 break
         return reg_matrix, reg_trans_matrix, eval(local_dir.split('-')[0]) - 1
+    def _load_old_covariances(self):
+        """CPI: Load projected covariance cov_{i}.pt from ALL previous tasks.
+        Returns list of per-task covariance dicts [{chunk_idx: tensor}, ...] per layer.
+        Outer list: per task, inner list: per layer."""
+        if self.cpi_gamma <= 0 or self.cur_task_id == 0:
+            self._old_covariances = []
+            return
+        log_path = os.path.dirname(self.args.output_dir)
+        local_dir = os.path.basename(self.args.output_dir)
+        self._old_covariances = []
+        if hasattr(self, "previous_lora_path") and self.previous_lora_path:
+            previous_lora_list = self.previous_lora_path.split(',')
+            for task_path in previous_lora_list:
+                task_covs = []
+                i = 0
+                for module in self.model.modules():
+                    if hasattr(module, 'get_feature'):
+                        path = os.path.join(task_path, "cov_{}.pt".format(i))
+                        if os.path.exists(path):
+                            cov = torch.load(path, map_location='cpu')
+                            task_covs.append(cov)
+                        else:
+                            task_covs.append(None)
+                        i += 1
+                self._old_covariances.append(task_covs)
+            print(f"[CPI] Loaded covariances from {len(self._old_covariances)} previous tasks (explicit paths)")
+            return
+        # Discover previous task dirs by index
+        cur_idx = int(local_dir.split('-')[0])
+        all_dirs = sorted(os.listdir(log_path))
+        for all_dir in all_dirs:
+            dir_path = os.path.join(log_path, all_dir)
+            if not os.path.isdir(dir_path):
+                continue
+            try:
+                dir_idx = int(all_dir.split('-')[0])
+            except (ValueError, IndexError):
+                continue
+            if dir_idx < cur_idx:
+                task_covs = []
+                i = 0
+                for module in self.model.modules():
+                    if hasattr(module, 'get_feature'):
+                        path = os.path.join(dir_path, "cov_{}.pt".format(i))
+                        if os.path.exists(path):
+                            cov = torch.load(path, map_location='cpu')
+                            task_covs.append(cov)
+                        else:
+                            task_covs.append(None)
+                        i += 1
+                self._old_covariances.append(task_covs)
+        print(f"[CPI] Loaded covariances from {len(self._old_covariances)} previous tasks (auto-discovered)")
     def get_reg_matrix(self):
         """
         V11: Project current LoRA A into null-space of old tasks' GPM bases.
+        CPI: Use discriminant matrix D_t = C_tilde - gamma * C_bar_old for init.
         Also re-initialize prompt_key for learned routing (ROOT-style SVD).
         """
         self.feature_list, self.feature_trans_list, self._cur_task = self.load_previous_reg_matrix()
+        self._load_old_covariances()
         # ================================================================
         # V11: Prompt-key re-initialization (ROOT-style)
                     ).to("cuda:0")
                 self.feature_mat.append(feature_mat)
+                # CPI+OAP: Contrastive Projected Initialization + Overlap-Aware Projection
+                # D_t = C_tilde - gamma * C_bar_old; A_t = top-r eigvecs of D_t
+                # OAP: Q = I - beta_l * P_old (adaptive relaxation per-layer per-chunk)
+                # gamma=0 → original C5; gamma>0 → contrastive discriminative init
+                # eta=0 → strict InfLoRA; eta>0 → OAP relaxation
+                _oap_betas = {}  # index -> beta_l, used by InfLoRA projection below
+                _diag_layer = {}  # diagnostic data per chunk
                 if self._task_covariance and i < len(self._task_covariance):
                     r = module.lora_q.lora_A.data.shape[0]  # LoRA rank
+                    projected_cov_layer = {}  # store C_tilde per chunk for saving
+                    # Compute weighted C_bar_old for this layer (Weighted CPI)
+                    # rho_{s,t} = tr(C̃_s · C_t) / (tr(C̃_s) * tr(C_t)) — domain proximity weight
+                    C_bar_old_layer = {}
+                    C_bar_weights = {}  # idx -> accumulated weight sum
+                    if self.cpi_gamma > 0 and self._old_covariances:
+                        for task_covs in self._old_covariances:
+                            if i < len(task_covs) and task_covs[i] is not None:
+                                for idx, cov_tensor in task_covs[i].items():
+                                    C_s = cov_tensor.float().cuda()
+                                    # Compute domain-proximity weight rho_{s,t}
+                                    if idx in self._task_covariance[i]:
+                                        C_t_for_w = self._task_covariance[i][idx].to(C_s.device).float()
+                                        tr_s = torch.trace(C_s) + 1e-12
+                                        tr_t = torch.trace(C_t_for_w) + 1e-12
+                                        tr_cross = torch.trace(C_s @ C_t_for_w)
+                                        rho_st = max(0.0, (tr_cross / (tr_s * tr_t)).item())
+                                    else:
+                                        rho_st = 1.0  # fallback: equal weight
+                                    if idx not in C_bar_old_layer:
+                                        C_bar_old_layer[idx] = rho_st * C_s
+                                        C_bar_weights[idx] = rho_st
+                                    else:
+                                        C_bar_old_layer[idx] = C_bar_old_layer[idx] + rho_st * C_s
+                                        C_bar_weights[idx] += rho_st
+                        for idx in C_bar_old_layer:
+                            w = C_bar_weights[idx]
+                            if w > 1e-12:
+                                C_bar_old_layer[idx] /= w
                     for index in self.feature_list[i].keys():
                         C_t   = self._task_covariance[i][index]       # [step, step]
                         P_old = feature_mat[index]                     # [step, step]
+                        # OAP: compute overlap ratio rho_l and adaptive beta_l
+                        if self.oap_eta > 0:
+                            # Warmup: eta_eff = eta * min(1, (t-1)/T_warmup)
+                            t_idx = self.cur_task_id  # 0-indexed
+                            if self.oap_warmup > 0 and t_idx > 0:
+                                warmup_factor = min(1.0, t_idx / self.oap_warmup)
+                            else:
+                                warmup_factor = 1.0
+                            eta_eff = self.oap_eta * warmup_factor
+                            # beta_min higher for early tasks (conservative)
+                            beta_min_eff = self.oap_beta_min if warmup_factor >= 1.0 else max(self.oap_beta_min, 0.7)
+                            C_t_f = C_t.to(P_old.device).float()
+                            P_old_f = P_old.float()
+                            tr_overlap = torch.trace(P_old_f @ C_t_f)
+                            tr_total = torch.trace(C_t_f) + 1e-12
+                            rho_l = (tr_overlap / tr_total).item()
+                            beta_l = max(beta_min_eff, 1.0 - eta_eff * rho_l)
+                            _oap_betas[index] = beta_l
+                        else:
+                            beta_l = 1.0
+                            rho_l = 0.0
+                            _oap_betas[index] = 1.0
+                        # Diagnostic: SSE before OAP
+                        _ct_on_device = C_t.to(P_old.device).float()
+                        _sse_before = (torch.trace(P_old.float() @ _ct_on_device) / (torch.trace(_ct_on_device) + 1e-12)).item()
+                        Q     = torch.eye(module.step, device=P_old.device) - beta_l * P_old
                         C_tilde = Q @ C_t.to(P_old.device) @ Q
+                        # Diagnostic: SSE after OAP = (1-beta_l)^2 * SSE_before (theoretical)
+                        _sse_after = (1 - beta_l)**2 * _sse_before
+                        # Save projected covariance for future CPI
+                        projected_cov_layer[index] = C_tilde.detach().cpu()
+                        # CPI: subtract old mean covariance
+                        if self.cpi_gamma > 0 and index in C_bar_old_layer:
+                            D_t = C_tilde - self.cpi_gamma * C_bar_old_layer[index].to(C_tilde.device)
+                        else:
+                            D_t = C_tilde
                         # Enforce symmetry and add tiny jitter for numerical stability
+                        D_t = (D_t + D_t.T) * 0.5
+                        if torch.isnan(D_t).any() or torch.isinf(D_t).any():
+                            print(f'[CPI] WARNING: Layer {i+1} index {index} contains NaN/Inf in D_t. Cleaning.')
+                            D_t = torch.nan_to_num(D_t, nan=0.0, posinf=0.0, neginf=0.0)
+                        jitter_scale = 1e-10 * (D_t.abs().mean() + 1e-12)
+                        D_t += jitter_scale * torch.eye(D_t.shape[0], device=D_t.device)
                         try:
+                            eigvals, eigvecs = torch.linalg.eigh(D_t.float())
                         except torch._C._LinAlgError as e:
+                            print(f'[CPI] WARNING: Layer {i+1} index {index} - linalg.eigh failed: {e}. Falling back.')
                             continue
+                        # CPI: only use eigenvectors with POSITIVE eigenvalues
+                        # (negative eigenvalues = directions where old tasks dominate)
+                        pos_mask = eigvals > 1e-6
+                        n_pos = int(pos_mask.sum().item())
+                        n_total = eigvals.shape[0]
+                        lambda_min_pos = eigvals[pos_mask].min().item() if n_pos > 0 else 0.0
+                        lambda_max_pos = eigvals[pos_mask].max().item() if n_pos > 0 else 0.0
+                        # Store per-chunk diagnostic
+                        _diag_layer[index] = {
+                            'rho_l': rho_l, 'beta_l': beta_l,
+                            'sse_before': _sse_before, 'sse_after': _sse_after,
+                            'n_pos_eigvals': n_pos, 'n_total_eigvals': n_total,
+                            'lambda_min_pos': lambda_min_pos, 'lambda_max_pos': lambda_max_pos,
+                            'lambda_min_pos_over_r': lambda_min_pos / r,  # Theorem 3 margin
+                        }
+                        if pos_mask.sum() == 0:
+                            print(f'[CPI] Layer {i+1} index {index}: no positive eigenvalues, fallback to Kaiming+InfLoRA')
                             continue
+                        pos_eigvals = eigvals[pos_mask]
+                        pos_eigvecs = eigvecs[:, pos_mask]
+                        # Take top-r from positive eigenvalues (sorted ascending by eigh)
+                        n_take = min(r, pos_eigvals.shape[0])
+                        top_eigvecs = pos_eigvecs[:, -n_take:].flip(dims=[1])  # [step, n_take]
+                        if n_take < r:
+                            # Pad with Kaiming random vectors in null-space
+                            pad = torch.randn(top_eigvecs.shape[0], r - n_take, device=top_eigvecs.device)
+                            top_eigvecs = torch.cat([top_eigvecs, pad], dim=1)
                         A_init = top_eigvecs.T  # [r, step]
                         dtype  = module.lora_q.lora_A.data.dtype
                         sl     = slice(index * module.step, (index + 1) * module.step)
                         module.lora_q.lora_A.data[:, sl].copy_(A_init.to(dtype))
                         module.lora_v.lora_A.data[:, sl].copy_(A_init.to(dtype))
+                    cpi_label = "CPI" if self.cpi_gamma > 0 else "C5"
+                    oap_info = ""
+                    if self.oap_eta > 0 and _oap_betas:
+                        avg_beta = sum(_oap_betas.values()) / len(_oap_betas)
+                        oap_info = f", OAP avg_beta={avg_beta:.3f}"
+                    # Diagnostic summary for this layer
+                    if _diag_layer:
+                        avg_rho = sum(d['rho_l'] for d in _diag_layer.values()) / len(_diag_layer)
+                        avg_sse_b = sum(d['sse_before'] for d in _diag_layer.values()) / len(_diag_layer)
+                        avg_sse_a = sum(d['sse_after'] for d in _diag_layer.values()) / len(_diag_layer)
+                        avg_lmin = sum(d['lambda_min_pos_over_r'] for d in _diag_layer.values()) / len(_diag_layer)
+                        avg_npos = sum(d['n_pos_eigvals'] for d in _diag_layer.values()) / len(_diag_layer)
+                        print(f'[{cpi_label}] Layer {i+1}: A_t init (gamma={self.cpi_gamma}{oap_info}) '
+                              f'| rho_l={avg_rho:.3f} SSE={avg_sse_b:.3f}->{avg_sse_a:.3f} '
+                              f'lambda_min+/r={avg_lmin:.4f} n_pos={avg_npos:.1f}/{_diag_layer[list(_diag_layer.keys())[0]]["n_total_eigvals"]}')
+                    else:
+                        print(f'[{cpi_label}] Layer {i+1}: A_t initialized (gamma={self.cpi_gamma}{oap_info}).')
+                    # Store projected covariance for saving later
+                    if not hasattr(self, '_projected_covariances'):
+                        self._projected_covariances = []
+                    self._projected_covariances.append(projected_cov_layer)
+                    # Store diagnostics for saving
+                    if not hasattr(self, '_init_diagnostics'):
+                        self._init_diagnostics = []
+                    self._init_diagnostics.append(_diag_layer)
+                # InfLoRA / OAP projection
+                # OAP: A_t <- A_t(I - beta_l * P_old) instead of A_t <- A_t(I - P_old)
+                # beta_l < 1 allows shared directions to remain (Theorem 4: forgetting
+                # bounded by p_e * (1-beta_l) * M, gated by routing accuracy)
                 for index in self.feature_list[i].keys():
+                    beta_l = _oap_betas.get(index, 1.0)
+                    sl = slice(index * module.step, (index + 1) * module.step)
+                    module.lora_q.lora_A.data[:, sl].copy_(
+                        module.lora_q.lora_A.data[:, sl]
+                        - beta_l * torch.mm(
+                            module.lora_q.lora_A.data[:, sl],
                             feature_mat[index]
                         )
                     )
+                    module.lora_v.lora_A.data[:, sl].copy_(
+                        module.lora_v.lora_A.data[:, sl]
+                        - beta_l * torch.mm(
+                            module.lora_v.lora_A.data[:, sl],
                             feature_mat[index]
                         )
                     )
         for i in range(len(self.feature_list)):
             torch.save(self.feature_list[i], os.path.join(self.args.output_dir, 'reg_{}.pt'.format(i)))
+        # CPI: Save projected covariance for future tasks' contrastive init
+        _proj_covs = getattr(self, '_projected_covariances', self._task_covariance)
+        if _proj_covs:
+            for i in range(len(_proj_covs)):
+                torch.save(_proj_covs[i], os.path.join(self.args.output_dir, 'cov_{}.pt'.format(i)))
+            print(f'[CPI] Saved {len(_proj_covs)} projected covariance matrices.')
+        # Save CPI/OAP diagnostics for post-hoc analysis
+        _diag = getattr(self, '_init_diagnostics', None)
+        if _diag:
+            diag_path = os.path.join(self.args.output_dir, 'init_diagnostics.pt')
+            torch.save(_diag, diag_path)
+            print(f'[DIAG] Saved init diagnostics to {diag_path}')
         # Save trans_input GPM bases
         if getattr(self.model.encoder, "routing_mode", "") == "learned" and hasattr(self, "feature_trans_list"):
             os.makedirs(os.path.join(self.args.output_dir, 'trans_input'), exist_ok=True)

improve_gainlora/src/run_t5.py CHANGED Viewed

@@ -198,6 +198,22 @@ class ModelArguments:
         default=100,
         metadata={"help": "Number of training batches for C5 activation covariance collection."},
     )
     run_single: bool = field(
         default=False,
@@ -961,7 +977,11 @@ def main():
             precond_eps=model_args.precond_eps,
             entropy_warmup_ratio=model_args.entropy_warmup_ratio,
             n_batches_c5=model_args.n_batches_c5,
-            previous_lora_path=model_args.previous_lora_path
         )
         if training_args.do_train:
             if not model_args.run_single:  # C5 is only useful for tasks t>=2
@@ -1133,6 +1153,32 @@ def main():
             with open(os.path.join("logs_and_outputs", training_args.run_name, "outputs", "task_order.txt"), 'w') as f:
                 f.write(data_args.task_order)
     return results

         default=100,
         metadata={"help": "Number of training batches for C5 activation covariance collection."},
     )
+    cpi_gamma: Optional[float] = field(
+        default=0.0,
+        metadata={"help": "CPI contrastive strength. 0=C5 (no contrastive), >0=CPI. Recommended: 0.5."},
+    )
+    oap_eta: Optional[float] = field(
+        default=0.0,
+        metadata={"help": "OAP relaxation strength. 0=strict InfLoRA, >0=adaptive relaxation. Recommended: 0.5."},
+    )
+    oap_beta_min: Optional[float] = field(
+        default=0.3,
+        metadata={"help": "OAP minimum protection. Lower=more sharing, higher=safer. Range (0,1]."},
+    )
+    oap_warmup: Optional[int] = field(
+        default=3,
+        metadata={"help": "OAP warmup: number of tasks before full OAP. Early tasks use conservative beta_min."},
+    )
     run_single: bool = field(
         default=False,
             precond_eps=model_args.precond_eps,
             entropy_warmup_ratio=model_args.entropy_warmup_ratio,
             n_batches_c5=model_args.n_batches_c5,
+            previous_lora_path=model_args.previous_lora_path,
+            cpi_gamma=model_args.cpi_gamma,
+            oap_eta=model_args.oap_eta,
+            oap_beta_min=model_args.oap_beta_min,
+            oap_warmup=model_args.oap_warmup
         )
         if training_args.do_train:
             if not model_args.run_single:  # C5 is only useful for tasks t>=2
             with open(os.path.join("logs_and_outputs", training_args.run_name, "outputs", "task_order.txt"), 'w') as f:
                 f.write(data_args.task_order)
+            # [DIAG] Save routing decision stats for p_e analysis
+            if training_args.model_name == 'specroute' and hasattr(trainer.model.encoder, '_routing_decisions'):
+                routing_decisions = trainer.model.encoder._routing_decisions
+                if routing_decisions:
+                    all_decisions = torch.cat(routing_decisions, dim=0)  # (N,)
+                    n_tasks = len(trainer.model.encoder.spectral_signatures) + 1
+                    # Current task is always index 0 in spectral routing
+                    # (signatures are ordered: [current, old_1, old_2, ...])
+                    routed_to_current = (all_decisions == 0).float().mean().item()
+                    diag_msg = (f'[DIAG-ROUTING] Task {cur_task} (id={cur_task_id}): '
+                                f'routed_to_current={routed_to_current:.3f} '
+                                f'({int((all_decisions == 0).sum())}/{len(all_decisions)}) '
+                                f'n_tasks={n_tasks}')
+                    print(diag_msg)
+                    # Distribution across all tasks
+                    for t in range(n_tasks):
+                        frac = (all_decisions == t).float().mean().item()
+                        print(f'  task_idx={t}: {frac:.3f}')
+                    # Save routing decisions tensor
+                    save_path_diag = training_args.output_dir
+                    if not prompt_config["run_single"]:
+                        save_path_diag = training_args.output_dir + "/saved_weights"
+                    torch.save(all_decisions, os.path.join(save_path_diag, 'routing_decisions.pt'))
+                    # Reset for next eval round
+                    trainer.model.encoder._routing_decisions = []
     return results

improve_gainlora/src/t5_specroute.py CHANGED Viewed

@@ -586,6 +586,13 @@ class T5Stack(T5PreTrainedModel):
                 else:
                     self.all_attn_weights.append(key_attention_weights.squeeze(2).mean(dim=0, keepdim=True).detach().to(torch.float).cpu().numpy())
             self.key_attention_weights = key_attention_weights
         else:
             # Decoder or run_single: use whatever was passed (from encoder)

                 else:
                     self.all_attn_weights.append(key_attention_weights.squeeze(2).mean(dim=0, keepdim=True).detach().to(torch.float).cpu().numpy())
+                # [DIAG] Log routing decisions for p_e measurement
+                # For hard Top-1, record which task index was selected per sample
+                if not hasattr(self, '_routing_decisions'):
+                    self._routing_decisions = []
+                routed_task_idx = key_attention_weights.squeeze(2).argmax(dim=1)  # (B,)
+                self._routing_decisions.append(routed_task_idx.detach().cpu())
             self.key_attention_weights = key_attention_weights
         else:
             # Decoder or run_single: use whatever was passed (from encoder)

root_gainlora/root_gainlora.tex ADDED Viewed

	@@ -0,0 +1,199 @@

+\documentclass[border=10pt]{standalone}
+\usepackage{tikz}
+\usetikzlibrary{shapes.geometric, arrows.meta, calc}
+\usepackage{pifont}
+\usepackage{amsmath}
+\usepackage{fontawesome5}
+\begin{document}
+\begin{tikzpicture}[
+    >=stealth,
+    font=\sffamily\footnotesize,
+    lm_outer/.style={rectangle, draw=blue!60!black, fill=blue!15, thick, rounded corners=4mm, minimum width=4cm, minimum height=5.5cm, align=center},
+    lm_inner/.style={rectangle, fill=cyan!5, rounded corners=2mm, minimum width=3.4cm, minimum height=4cm},
+    input_box/.style={rectangle, draw=orange!50, fill=orange!20, rounded corners=1mm, text width=5cm, inner sep=4pt, align=left},
+    ans_box/.style={rectangle, draw=orange!50, fill=orange!20, font=\bfseries, inner sep=4pt},
+    q_h_box/.style={rectangle, draw=yellow!80!orange, fill=yellow!30, rounded corners=1.5mm, minimum width=3cm, minimum height=4mm},
+    trap_A_old/.style={trapezium, draw=teal!80!black, fill=green!10, trapezium angle=65, thick, minimum width=1.5cm, minimum height=0.7cm, align=center, shape border rotate=180},
+    trap_B_old/.style={trapezium, draw=teal!80!black, fill=green!10, trapezium angle=65, thick, minimum width=1.5cm, minimum height=0.7cm, align=center},
+    trap_A_new/.style={trapezium, draw=blue!80!black, fill=blue!10, trapezium angle=65, thick, minimum width=1.5cm, minimum height=0.7cm, align=center, shape border rotate=180},
+    trap_B_new/.style={trapezium, draw=blue!80!black, fill=blue!10, trapezium angle=65, thick, minimum width=1.5cm, minimum height=0.7cm, align=center},
+    var_w/.style={rectangle, draw=blue!80!black, fill=cyan!10, minimum size=6mm, inner sep=2pt, align=center},
+    var_a_old/.style={rectangle, draw=teal!80!black, fill=green!15, minimum size=5mm, inner sep=2pt},
+    var_a_new/.style={rectangle, draw=blue!80!black, fill=blue!15, minimum size=5mm, inner sep=2pt},
+    gate_old/.style={rectangle, draw=black, thick, fill=gray!20, rounded corners=2mm, minimum width=1.8cm, minimum height=1.2cm, align=center},
+    gate_new/.style={rectangle, draw=red!80!black, thick, fill=gray!20, rounded corners=2mm, minimum width=1.8cm, minimum height=1.2cm, align=center},
+    const_box/.style={rectangle, draw=blue!80!black, fill=blue!10, text width=1.8cm, align=center, minimum height=0.9cm},
+    step_box/.style={rectangle, draw=blue!40, fill=blue!10, text width=2.8cm, align=left, inner sep=3pt},
+    plus_op/.style={circle, draw=black, thick, fill=white, inner sep=0pt, minimum size=4.5mm},
+    times_op/.style={circle, draw=black, thick, fill=white, inner sep=0pt, minimum size=4.5mm}
+]
+% =======================================================
+% PHẦN (a) - EXPANDABLE LORA ARCHITECTURE
+% =======================================================
+% 1. LM Box & Inputs
+\node[input_box] (input_a) at (0, 0) {Someone who had a very bad flight might be given a trip in this to make up for it?\\ \textbf{Option:} (A)first class (B)propitious (C)reputable (D)one (E)sufficient};
+\node[font=\large] at (-3, 0) {$\boldsymbol{x}$};
+\node[lm_outer] (lm) at (0, 4.5) {};
+\node[font=\bfseries] at (0, 6.8) {Language Models};
+\node[lm_inner] (lm_in) at (0, 4.2) {};
+\node[ans_box] (ans) at (0, 8.2) {Answer: A};
+\draw[->, thick, line width=1.2pt] (0, 7.25) -- (ans.south);
+\node[q_h_box] (h) at (0, 2.7) {};
+\node at (-1.8, 2.7) {$\boldsymbol{h}$};
+\node[q_h_box] (q) at (0, 5.7) {};
+\node at (-1.8, 5.7) {$\boldsymbol{q}$};
+\node[plus_op] (lm_plus) at (0, 4.2) {$\boldsymbol{+}$};
+\node[rectangle, draw=black, fill=white, align=center, font=\scriptsize, minimum width=1.3cm] (W) at (-1.1, 4.2) {Pre-trained\\Weights\\$\boldsymbol{W}$\\\textcolor{cyan}{\ding{101}}};
+\node[var_w] (wt_small) at (1.2, 4.2) {$w_t$};
+\draw[->, thick, line width=1.2pt] (input_a.north) -- (h.south);
+\draw[->, thick, line width=1.2pt] (h.north) -- (lm_plus.south);
+\draw[->, thick, line width=1.2pt] (lm_plus.north) -- (q.south);
+% 2. Big Expandable Box
+\draw[black, thick, rounded corners=4mm] (3.5, -0.5) rectangle (16.5, 9);
+\draw[dashed, thick] (1.5, 4.5) -- (3.5, 9);
+\draw[dashed, thick] (1.5, 3.9) -- (3.5, -0.5);
+% Legend
+\node[plus_op, minimum size=4mm] at (6.5, 8.5) {$\boldsymbol{+}$};
+\node at (7.5, 8.5) {Addition};
+\node[times_op, minimum size=4mm] at (9.2, 8.5) {$\boldsymbol{\times}$};
+\node at (10.4, 8.5) {Multiplication};
+\node[text=cyan, font=\large] at (12, 8.5) {\ding{101}};
+\node at (12.8, 8.5) {: Frozen};
+% Đầu ra của Expandable Box
+\node[var_w, minimum size=7mm] (wt_big) at (9.5, 7.5) {$w_t$};
+\node[plus_op] (big_plus) at (9.5, 6) {$\boldsymbol{+}$};
+\draw[->, thick, line width=1.5pt] (big_plus.north) -- (wt_big.south);
+% Trục Bus ngang
+\draw[thick, line width=1.5pt] (5.5, 4.8) -- (13.5, 4.8);
+\draw[thick, line width=1.5pt] (9.5, 4.8) -- (big_plus.south);
+% Background Old/New
+\fill[green!15, rounded corners=2mm] (4, 0) rectangle (9.2, 4.5);
+\node[font=\bfseries] at (6.6, -0.2) {Old Branches};
+\fill[blue!15, rounded corners=2mm] (10.5, 0) rectangle (13.5, 4.5);
+\node[font=\bfseries] at (12, -0.2) {New Branch};
+% Nhánh 1 (X = 5.5)
+\node[var_a_old] (a1) at (4.5, 4) {$a_1$};
+\node[times_op] (t1_top) at (5.5, 4) {$\boldsymbol{\times}$};
+\node[trap_A_old] (A1) at (5.5, 2.7) {$\boldsymbol{A_1}$\\ \tiny\textcolor{cyan}{\ding{101}}};
+\node[times_op] (t1_bot) at (5.5, 1.5) {$\boldsymbol{\times}$};
+\node[trap_B_old] (B1) at (5.5, 0.5) {$\boldsymbol{B_1}$\\ \tiny\textcolor{cyan}{\ding{101}}};
+\draw[->, thick, line width=1.5pt] (B1.north) -- (t1_bot.south);
+\draw[->, thick, line width=1.5pt] (t1_bot.north) -- (A1.south);
+\draw[->, thick, line width=1.5pt] (A1.north) -- (t1_top.south);
+\draw[->, thick, line width=1.5pt] (t1_top.north) -- (5.5, 4.8);
+\draw[->, thick, line width=1.2pt] (a1.east) -- (t1_top.west);
+% Dấu chấm lửng
+\node[font=\large] at (7.2, 2.7) {$\boldsymbol{\dots}$};
+\node[font=\large] at (7.2, 0.5) {$\boldsymbol{\dots}$};
+% Nhánh t-1 (X = 8.4)
+\node[var_a_old] (atm1) at (7.4, 4) {$a_{t-1}$};
+\node[times_op] (ttm1_top) at (8.4, 4) {$\boldsymbol{\times}$};
+\node[trap_A_old] (Atm1) at (8.4, 2.7) {$\boldsymbol{A_{t-1}}$\\ \tiny\textcolor{cyan}{\ding{101}}};
+\node[times_op] (ttm1_bot) at (8.4, 1.5) {$\boldsymbol{\times}$};
+\node[trap_B_old] (Btm1) at (8.4, 0.5) {$\boldsymbol{B_{t-1}}$\\ \tiny\textcolor{cyan}{\ding{101}}};
+\draw[->, thick, line width=1.5pt] (Btm1.north) -- (ttm1_bot.south);
+\draw[->, thick, line width=1.5pt] (ttm1_bot.north) -- (Atm1.south);
+\draw[->, thick, line width=1.5pt] (Atm1.north) -- (ttm1_top.south);
+\draw[->, thick, line width=1.5pt] (ttm1_top.north) -- (8.4, 4.8);
+\draw[->, thick, line width=1.2pt] (atm1.east) -- (ttm1_top.west);
+% Nhánh t (X = 12.5)
+\node[var_a_new] (at) at (11.5, 4) {$a_t$};
+\node[times_op] (tt_top) at (12.5, 4) {$\boldsymbol{\times}$};
+\node[trap_A_new] (At) at (12.5, 2.7) {$\boldsymbol{A_t}$};
+\node[times_op] (tt_bot) at (12.5, 1.5) {$\boldsymbol{\times}$};
+\node[trap_B_new] (Bt) at (12.5, 0.5) {$\boldsymbol{B_t}$\\ \tiny\textcolor{red}{\faFire}};
+\draw[->, thick, line width=1.5pt] (Bt.north) -- (tt_bot.south);
+\draw[->, thick, line width=1.5pt] (tt_bot.north) -- (At.south);
+\draw[->, thick, line width=1.5pt] (At.north) -- (tt_top.south);
+\draw[->, thick, line width=1.5pt] (tt_top.north) -- (12.5, 4.8);
+\draw[->, thick, line width=1.2pt] (at.east) -- (tt_top.west);
+% Hộp chú thích Step
+\node[step_box] (step2) at (15, 7.2) {\textbf{2.} Integrating new and\\old LoRA branches};
+\draw[->, thick] (step2.west) -- (wt_big.east);
+\node[step_box] (step1) at (15, 4.6) {\textbf{1.} Expanding a new\\LoRA branch};
+% \draw[->, thick] (step1.west) -- (tt_top.east);
+\node[step_box] (step3) at (15, 2) {\textbf{3.} Updating new\\branch to learn new\\task};
+% \draw[->, thick] (step3.west) -- (tt_bot.east);
+\node[font=\large\bfseries] at (8, -1.5) {(a) Expandable LoRA Architecture in GainLoRA};
+% =======================================================
+% PHẦN (b) - GATING MODULES
+% =======================================================
+\begin{scope}[shift={(21.5, 0)}]
+    % Input
+    \node[input_box] (input_b) at (4.5, 0) {Someone who had a very bad flight might be given a trip in this to make up for it?\\ \textbf{Option:} (A)first class (B)propitious (C)reputable (D)one (E)sufficient};
+    \node[font=\large] at (1.5, 0) {$\boldsymbol{x}$};
+    % Nền Gating
+    \fill[green!15, rounded corners=2mm] (0.2, 1.8) rectangle (5.8, 3.5);
+    \fill[blue!15, rounded corners=2mm] (7.5, 1.8) rectangle (10.5, 3.5);
+    % Khối Gating
+    \node[gate_old] (g1) at (1.5, 2.6) {$\boldsymbol{g_1(\cdot)}$\\ \tiny\textcolor{cyan}{\ding{101}}};
+    \node[gate_old] (gtm1) at (4.5, 2.6) {$\boldsymbol{g_{t-1}(\cdot)}$\\ \tiny\textcolor{cyan}{\ding{101}}};
+    \node[gate_new] (gt) at (9, 2.6) {$\boldsymbol{g_t(\cdot)}$\\ \tiny\textcolor{cyan}{\ding{101}}\quad\textcolor{red}{\faFire}};
+    \node[font=\large] at (3, 2.6) {$\boldsymbol{\dots}$};
+    % Mũi tên từ x phân bổ lên gating
+    \draw[->, thick, line width=1.2pt] (4.5, 1) -- (4.5, 2);
+    \draw[->, thick, line width=1.2pt] (2.5, 1) -- (2.5, 1.5) -- (1.5, 1.5) -- (1.5, 2);
+    \draw[->, thick, line width=1.2pt] (6.5, 1) -- (6.5, 1.5) -- (9, 1.5) -- (9, 2);
+    % Outputs a_i
+    \node[var_a_old] (a1_b) at (1.5, 4.5) {$a_1$};
+    \node[var_a_old] (atm1_b) at (4.5, 4.5) {$a_{t-1}$};
+    \node[var_a_new] (at_b) at (9, 4.5) {$a_t$};
+    \draw[->, thick, line width=1.2pt] (1.5, 3.2) -- (a1_b.south);
+    \draw[->, thick, line width=1.2pt] (4.5, 3.2) -- (atm1_b.south);
+    \draw[->, thick, line width=1.2pt] (9, 3.2) -- (at_b.south);
+    % Hộp Constraints
+    \node[const_box] (c_init) at (1.5, 6.5) {Initialization\\constraints};
+    \node[const_box] (c_update) at (4.5, 6.5) {Updating\\constraints};
+    \node[const_box] (c_impose) at (9, 6.5) {Imposing\\constraints};
+    % Khung nét đứt Constraints có mũi nhọn
+    \draw[dashed, blue!80!black, thick, rounded corners=2mm]
+        (0, 5.5) -- (6, 5.5) -- (6, 6.3) -- (6.8, 6.5) -- (6, 6.7) -- (6, 7.5) -- (0, 7.5) -- cycle;
+    % \draw[->, thick, line width=1.2pt] (4.5, 6.05) -- (atm1_b.north);
+    % Vòng lặp phản hồi
+    \draw[->, thick, line width=1.2pt] (c_impose.east) -- ++(1,0) |- (gt.east);
+    \node[font=\large\bfseries] at (5, -1.5) {(b) Gating Modules in GainLoRA};
+\end{scope}
+\end{tikzpicture}
+\end{document}