natmin322 commited on Mar 20

Commit

2dea138

1 Parent(s): a4f8971

v7

Browse files

Files changed (30) hide show

human_working_IdeaMethod_and_discuss/C2_analysis_and_revision.md +0 -311
human_working_IdeaMethod_and_discuss/comprehensive_methodology.md +0 -663
human_working_IdeaMethod_and_discuss/critical_analysis_report.md +0 -245
human_working_IdeaMethod_and_discuss/discusstion.txt +0 -3
human_working_IdeaMethod_and_discuss/disscuss_1_C2_C1.txt +0 -3
human_working_IdeaMethod_and_discuss/gainlora.txt +0 -3
human_working_IdeaMethod_and_discuss/idea_analysis_from_discussion.md +0 -542
human_working_IdeaMethod_and_discuss/method.md +0 -458
human_working_IdeaMethod_and_discuss/new_idea_analysis.md +0 -470
human_working_IdeaMethod_and_discuss/new_idea_modifier.txt +0 -3
human_working_IdeaMethod_and_discuss/novelty_search_report.md +0 -168
human_working_IdeaMethod_and_discuss/proposal_gainlora_upgrade.md +0 -305
human_working_IdeaMethod_and_discuss/research_rule.txt +0 -3
human_working_IdeaMethod_and_discuss/revised_idea_analysis.md +0 -485
human_working_IdeaMethod_and_discuss/settings.txt +0 -3
human_working_IdeaMethod_and_discuss/simple_idea.txt +0 -3
human_working_IdeaMethod_and_discuss/work_ethic.txt +0 -3
human_working_IdeaMethod_and_discuss/working_method.txt +0 -3
improve_gainlora/v6_discuss.md +142 -0
results/benchmark_explanation.md +0 -219
results/comparison_results.md +0 -257
results/contribution2_implementation_analysis.md +0 -176
results/deep_theoretical_analysis_svd_lora_routing.md +545 -0
results/experiment_versions.md +165 -515
results/review_1.md +0 -154
results/review_2.md +0 -184
results/review_3.md +0 -129
results/review_4.md +0 -130
results/specroute_v2_diagnosis.md +0 -174
results/v5_deep_analysis.md +164 -0

human_working_IdeaMethod_and_discuss/C2_analysis_and_revision.md DELETED Viewed

@@ -1,311 +0,0 @@
-# PHÂN TÍCH PHẢN BIỆN C2 VÀ XÂY LẠI KHUNG LÝ THUYẾT
-## Theo nguyên tắc: Phân tích → Điểm yếu → Động lực → Cải tiến
-**Date**: Revision sau phản biện C2 + C1
----
-# PHẦN 1: ĐÁNH GIÁ PHẢN BIỆN — C2 (Grassmann-OT Routing)
-## 1.1 Tóm tắt phản biện
-Phản biện chỉ ra 4 vấn đề cốt lõi:
-| # | Vấn đề | Mức độ |
-|---|--------|--------|
-| 1 | **"Tại sao OT?"** — "cân bằng toàn cục" không cần thiết cho routing. OT giải bài toán matching phân phối, nhưng routing là per-input assignment | **Fatal** |
-| 2 | **Inference batch_size=1 → OT suy biến thành argmin** — mất hoàn toàn ý nghĩa OT. CL inference thường là per-sample | **Fatal** |
-| 3 | **Không có đảm bảo lý thuyết OT tốt hơn simple max-fit** — không ai chứng minh OT routing > softmax routing cho input assignment | **Nghiêm trọng** |
-| 4 | **"Interesting but not necessary"** — novelty không đi kèm necessity. Đây là flash of insight, không phải principled reasoning | **Cốt lõi** |
-## 1.2 Phán xét: Phản biện ĐÚNG — C2 (OT) thiếu nền tảng vững
-### Phân tích theo chuỗi logic research_rule.txt:
-**Bước 1 — OT giải bài toán gì?**
-OT (Optimal Transport) tìm coupling tối ưu giữa 2 phân phối: vận chuyển "khối lượng" từ phân phối nguồn → đích với chi phí tổng nhỏ nhất.
-$$\Pi^* = \arg\min_{\Pi \in \mathcal{U}(a,b)} \langle C, \Pi \rangle$$
-Trong đó $\mathcal{U}(a,b)$ là tập các coupling thỏa marginal constraints.
-**Bước 2 — Routing giải bài toán gì?**
-"Input $x$ này nên được xử lý bởi expert nào?" → đây là **per-input assignment**, không phải distribution matching.
-**Bước 3 — Mâu thuẫn cốt lõi:**
-| Khía cạnh | OT | Routing trong CL |
-|-----------|-----|-------------------|
-| **Đơn vị hoạt động** | Batch-level (cần batch để xây distribution) | Per-input (mỗi input cần decision riêng) |
-| **Mục tiêu** | Minimize tổng chi phí vận chuyển toàn cục | Maximize accuracy routing cho TỪNG input |
-| **Constraint** | Marginal constraints (balance) | Không cần balance — nếu 90% test là task A thì 90% nên route tới A |
-| **Batch_size=1** | Suy biến: $\Pi$ chỉ có 1 hàng → argmin cost = assignment đơn giản | Hoạt động bình thường |
-**Bước 4 — Lý do "global balance" KHÔNG hợp lệ cho CL:**
-- Trong MoE training: balance cần thiết để prevent expert collapse (experts không được train → die). OT load-balancing hợp lý (BASE Layers, Sinkhorn Routing).
-- Trong CL inference: TẤT CẢ experts đã frozen → không có collapse risk → balance là constraint thừa, thậm chí có hại (bắt route sai expert chỉ để "balance").
-**Bước 5 — Kết luận:**
-> **C2 (Grassmann-OT Routing) bị reject.** OT được chọn vì "novel" (chưa ai dùng OT cho CL routing), KHÔNG phải vì nó giải quyết một vấn đề thực sự tốt hơn alternatives. Đây chính xác là "flash of insight" mà research_rule.txt cảnh báo.
-### Bằng chứng từ code: Code KHÔNG implement OT
-Quan sát quan trọng: **Code hiện tại (t5_specroute.py) implement projection-based softmax routing, KHÔNG phải OT.**
-```python
-# Từ t5_specroute.py::compute_spectral_routing()
-fit_scores = torch.cat(fits, dim=1)  # (B, n_tasks)
-weights = torch.softmax(fit_scores / self.routing_temperature, dim=1)  # softmax, NOT OT
-```
-→ Code đã đi đúng hướng. Chỉ có idea document đề xuất OT mà không bao giờ implement. Đây là dấu hiệu rõ ràng rằng khi chạm vào thực tế, OT không cần thiết.
----
-# PHẦN 2: ĐÁNH GIÁ PHẢN BIỆN — C1 (Spectral LoRA Signatures)
-## 2.1 Tóm tắt phản biện C1
-Phản biện nói C1 "đã tương đối tốt" nhưng cần:
-> "Tại sao spectral signature tốt hơn prompt key? Ngoài việc 'có thông tin hình học', cần chứng minh nó giúp routing CHÍNH XÁC HƠN ở task boundaries, nơi input có thể gần với nhiều task."
-## 2.2 Phán xét: Phản biện ĐÚNG — C1 cần motivation mạnh hơn
-C1 hiện tại giải thích *what* (SVD cho signature) nhưng thiếu *why* ở level sâu. Cần chứng minh:
-### Why spectral signature > prompt key? — 5 lý do toán học
-**Lý do 1: Prompt key là INDIRECT representation**
-- GainLoRA: $w_t = \sigma(\text{cos}(\text{trans\_input}(x), \text{prompt\_key}_t))$
-- `prompt_key` là vector HỌC RIÊNG, không liên hệ trực tiếp với computation mà LoRA thực hiện
-- Hậu quả: routing decision dựa trên "input GIỐNG gì" (similarity space), KHÔNG phải "expert NÀO phù hợp xử lý" (functional space)
-**Lý do 2: Spectral signature là DIRECT functional representation**
-- SVD of $\Delta W_t = B_t A_t = U_t \Sigma_t V_t^T$
-- Right singular vectors $V_t$: chính xác các hướng trong input space mà expert $t$ **sẽ modify mạnh nhất**
-- Singular values $\sigma_t$: mức độ modification theo từng hướng
-- **Proposition (từ InfLoRA)**: Fine-tuning $A_t$ = fine-tuning $W$ trong span($B_t$). Nên SVD of $B_t A_t$ capture CHÍNH XÁC vùng hoạt động.
-- Routing dựa trên spectral signature = "expert nào sẽ tạo ra thay đổi lớn nhất cho input này?" → trực tiếp đúng mục đích
-**Lý do 3: Prompt key CẦN GPM protection → vẫn bị drift**
-- GainLoRA cần 3 bộ GPM riêng cho routing: trans_input[0], trans_input[2], prompt_key
-- Dù có GPM, routing parameters vẫn drift (GPM chỉ protect trên subspace projection, KHÔNG guarantee zero-drift)
-- Spectral signature được compute TỪ frozen weights → **immutable by definition** → zero drift
-**Lý do 4: Multi-resolution vs single-resolution**
-- Prompt key: 1 vector $\in \mathbb{R}^d$ per task (global level)
-- Spectral signature: per-layer signatures (48 layers in T5-Large Q+V) → routing quyết định ở **mỗi layer** dựa trên local geometry
-- Lợi ích: Hai tasks có thể overlap ở low-level features nhưng diverge ở high-level → multi-resolution routing capture được
-**Lý do 5 — ĐIỂM MẠNH NHẤT: Task boundary behavior**
-Xét input $h$ nằm tại ranh giới (boundary) giữa task A và task B:
-**Với prompt key:**
-$$\text{cos}(\text{trans\_input}(h), \text{prompt\_key}_A) \approx \text{cos}(\text{trans\_input}(h), \text{prompt\_key}_B)$$
-→ Cả hai similarity gần bằng nhau → routing ambiguous
-→ Quyết định phụ thuộc vào **trans_input mapping** (learned, có thể drift) → không tin cậy tại boundary
-**Với spectral projection:**
-$$\text{fit}_A(h) = \frac{\sum_i \sigma_{A,i}^2 (v_{A,i}^T h)^2}{\sum_i \sigma_{A,i}^2 \cdot \|h\|^2} \quad \text{vs} \quad \text{fit}_B(h) = \frac{\sum_i \sigma_{B,i}^2 (v_{B,i}^T h)^2}{\sum_i \sigma_{B,i}^2 \cdot \|h\|^2}$$
-→ Đo **phần năng lượng của input nằm trong operating subspace** → thể hiện "expert nào sẽ tác động mạnh hơn lên input này"
-**Tại vùng boundary:**
-- Nếu các subspaces well-separated ($d_G(\mathcal{V}_A, \mathcal{V}_B)$ lớn): fit_A ≫ fit_B hoặc ngược lại → routing rõ ràng
-- Nếu subspaces overlap: cả hai experts đều xử lý được → soft blending (softmax) cho weighted combination → TỐT hơn hard assignment
-- Singular value weighting: ưu tiên expert có **directions quan trọng hơn** aligned với input → discriminative hơn uniform projection
-**So sánh chính thức:**
-| Tiêu chí | Prompt Key (GainLoRA) | Spectral Signature (SpecRoute) |
-|----------|----------------------|-------------------------------|
-| Nguồn gốc | Learned parameter (indirect) | SVD of LoRA weights (direct functional) |
-| Forgetting risk | Có (cần GPM protection) | Không (immutable from frozen weights) |
-| Resolution | Single global vector | Per-layer per-attention |
-| Task boundary | Depends on learned mapping | Depends on actual subspace overlap |
-| Extra parameters | trans_input (MLP) + prompt_key | None (0 extra params) |
-| Extra GPM cost | 3 sets of GPM projections | None |
-| Interpretability | Black-box similarity | Geometric: "bao nhiêu % input energy nằm trong expert's subspace" |
----
-# PHẦN 3: XÂY LẠI KHUNG LÝ THUYẾT — KILL OT, RESTRUCTURE C2
-## 3.1 Nguyên tắc (theo research_rule.txt)
-> "Ý tưởng phải xuất phát từ: phân tích lý thuyết → nhận diện điểm yếu → dynamic lực → đề xuất cải tiến → thí nghiệm → củng cố"
-Áp dụng:
-1. **Phân tích**: GainLoRA routing dựa trên learned gating (trans_input + prompt_key)
-2. **Điểm yếu**: 3 weakness cụ thể (xác định ở mục 3.2 bên dưới)
-3. **Động lực**: Frozen LoRA weights chứa đủ thông tin hình học cho routing → khai thác
-4. **Cải tiến**: Spectral projection routing — parameter-free, functionally grounded
-## 3.2 Ba điểm yếu cụ thể của GainLoRA routing (motivates C1 + C2)
-### Weakness 1: Routing Forgetting — Learned routing parameters drift
-GainLoRA cần GPM constraints cho trans_input (2 layers) + prompt_key. Nhưng:
-- GPM chỉ project gradient ra null-space → **approximate protection**, không guarantee zero-interference
-- Mỗi task mới "ăn" thêm subspace cho routing GPM → cạn kiệt capacity nhanh hơn
-- **Thí nghiệm quantify**: Cần đo routing accuracy trên old tasks TRƯỚC và SAU train new task → expect degradation dù có GPM
-### Weakness 2: Indirect Task Representation
-- `prompt_key_t` encode "đặc trưng" task $t$ → nhưng trong KHÔNG GIAN NÀO? Trong feature space của trans_input MLP — KHÔNG phải weight space hay task-functional space
-- Prompt key học "input nào GIỐNG task t" (similarity view), KHÔNG phải "expert nào NÊN xử lý input" (functional view)
-- Hệ quả: Khi input nằm ở boundary, similarity-based routing THIẾU thông tin functional → suboptimal
-### Weakness 3: Routing Overhead
-- Trans_input: 2-layer MLP (d_model → hidden → d_model) = 2 × d_model × hidden + biases
-- Prompt_key: d_model per task
-- GPM cho routing: 3 sets × dim per task × num_tasks
-- Tổng overhead tăng linearly với số tasks → scalability concern
-## 3.3 Cấu trúc Contributions mới (3C restructured)
-### C1: Spectral LoRA Signatures — Task Characterization via Frozen Weights
-**Chuỗi motivation:**
-1. LoRA branch task $t$: $\Delta W_t = B_t A_t$ (frozen after training)
-2. SVD: $\Delta W_t = U_t \Sigma_t V_t^T$
-3. Right singular vectors $V_t$ = input directions task $t$ operates on (InfLoRA Proposition 1)
-4. Singular values $\Sigma_t$ = importance of each direction
-5. **Signature** $\mathcal{S}_t = (V_t, \Sigma_t)$ per layer = complete characterization of task's operating subspace + importance profile
-6. Zero extra storage cost beyond weights (derived, not stored separately)
-7. Immutable (from frozen weights) → zero drift
-**Đóng góp**: Formalize spectral task characterization cho LoRA-based CL. Chứng minh rằng $(V_t, \Sigma_t)$ chứa đầy đủ thông tin cần thiết cho routing.
-### C2: Projection-Based Parameter-Free Routing — REPLACE OT
-**Chuỗi motivation:**
-1. **Weakness identification**: GainLoRA routing: learned + indirect + overhead (3 weaknesses ở 3.2)
-2. **Theoretical insight**: C1 cho ta $\mathcal{S}_t = (V_t, \Sigma_t)$ per layer — đây là direct characterization của "expert $t$ hoạt động trên vùng nào"
-3. **Natural routing criterion**: Weighted Rayleigh Quotient measures phần năng lượng input captured bởi expert's subspace
-$$\text{fit}_t(h) = \frac{\sum_{i=1}^{r} \sigma_{t,i}^2 \cdot (v_{t,i}^T h)^2}{\sum_{i=1}^{r} \sigma_{t,i}^2 \cdot \|h\|^2}$$
-4. **Routing weights**:
-$$w_t(h) = \frac{\exp(\text{fit}_t(h) / \tau)}{\sum_{k} \exp(\text{fit}_k(h) / \tau)}$$
-5. **Properties** (KHÔNG cần OT để achieve):
-   - **Parameter-free**: 0 learned routing params → **eliminates routing forgetting entirely** (1st weakness solved)
-   - **Functionally grounded**: Routing based on actual modification energy (2nd weakness solved)
-   - **Zero overhead**: No MLP, no GPM for routing (3rd weakness solved)
-   - **Per-input, constant-time**: $O(r \cdot d)$ per task per input — no iterative Sinkhorn
-   - **Works at batch_size=1**: Không suy biến — hoàn toàn per-input
-6. **Balance KHÔNG cần thiết**: Trong CL, routing accuracy > balance. Nếu test distribution lệch về task A → ĐÚNG khi route phần lớn tới A. OT bắt balance = routing SAI.
-**Đối sánh trực tiếp OT vs Projection Routing:**
-| Tiêu chí | OT Routing (đã reject) | Projection Routing (đề xuất) |
-|----------|----------------------|---------------------------|
-| Training | Sinkhorn iterations (iterative) | Softmax (one-shot) |
-| Inference batch=1 | Suy biến → argmin | Hoạt động bình thường |
-| Balance | Forced (có hại cho CL) | Natural (theo data distribution) |
-| Learned params | Cost matrix có thể learned | Zero |
-| Lý thuyết | "OT is principled" (cho distribution matching, KHÔNG cho per-input routing) | Weighted Rayleigh Quotient (chính xác cho subspace projection measurement) |
-| Complexity | $O(n^2 \cdot K \cdot \text{iters})$ per batch | $O(r \cdot d \cdot K)$ per input |
-### C3: Elastic Subspace Allocation (ESA) — Giữ nguyên
-**Không bị ảnh hưởng bởi phản biện C2, giữ nguyên design.**
----
-# PHẦN 4: KHUNG LÝ THUYẾT MỚI — SpecRoute v2
-## 4.1 Narrative mới (1 paragraph)
-Trong LoRA-based continual learning, routing mechanism đóng vai trò quyết định: nó xác định expert nào xử lý mỗi input tại inference khi task-ID không khả dụng (task-agnostic setting). Chúng tôi nhận diện **3 điểm yếu cốt lõi** của routing hiện tại (GainLoRA): (1) routing dựa trên learned parameters (trans_input MLP, prompt_key) → bị forgetting dù có GPM protection; (2) prompt key encode task identity trong **similarity space** (input giống gì?) thay vì **functional space** (expert nào nên xử lý?), gây suboptimal assignment tại task boundaries; (3) routing overhead tăng linearly với số tasks (extra MLP + GPM costs). Từ quan sát rằng frozen LoRA weights $\Delta W_t = B_t A_t$ chứa **đầy đủ thông tin hình học** về vùng hoạt động (operating subspace) của mỗi expert thông qua SVD, chúng tôi đề xuất **SpecRoute** — framework hoàn toàn parameter-free cho routing, dựa trên spectral signatures và projection-based assignment.
-## 4.2 Motivation chain (formal)
-```
-[Phân tích]    GainLoRA routing: cos(trans_input(x), prompt_key) → sigmoid
-                                  ↓
-[Điểm yếu 1]  Learned routing params (trans_input, prompt_key) → forgetting risk
-[Điểm yếu 2]  prompt_key = similarity space ≠ functional space → weak at boundaries
-[Điểm yếu 3]  Extra MLP + 3 GPM sets → overhead scales with tasks
-                                  ↓
-[Insight]      Frozen ΔW = BA → SVD → (V, Σ) = complete operating subspace characterization
-               Projection fit = weighted Rayleigh quotient = exactly measures "what % of
-               input energy lies in expert's operating subspace"
-                                  ↓
-[Đề xuất]      C1: Spectral Signatures (characterization)
-               C2: Projection-Based Routing (parameter-free, functionally grounded)
-               C3: Elastic Subspace Allocation (fair capacity management)
-                                  ↓
-[Consequences] ✓ Zero routing params → zero routing forgetting
-               ✓ Functionally grounded → better boundary routing
-               ✓ Zero routing overhead → better scalability
-               ✓ Simpler framework (remove trans_input, prompt_key, routing GPM, memory replay)
-```
-## 4.3 So sánh chuỗi motivation: OT (cũ) vs Projection Routing (mới)
-### Chuỗi OT (cũ) — BROKEN:
-```
-"OT is principled" → Tại sao cần principled routing? → "Global balance"
-→ Tại sao cần balance? → "Experts should be used evenly"
-→ Tại sao? → ??? (Trong CL, balance KHÔNG cần thiết, thậm chí có hại)
-→ BROKEN: Motivation chain terminates without valid root cause
-```
-### Chuỗi Projection Routing (mới) — SOLID:
-```
-"GainLoRA routing forgets + uses wrong space + adds overhead"
-→ Root cause: routing relies on LEARNED PARAMETERS SEPARATE FROM experts
-→ Solution: derive routing FROM expert weights (spectral signatures)
-→ Mechanism: weighted projection (Rayleigh quotient) — standard linear algebra tool
-→ Properties: parameter-free, functionally grounded, zero overhead
-→ SOLID: Motivation chain traces from concrete weakness to principled solution
-```
-## 4.4 Tại sao softmax đủ? Không cần mechanism phức tạp hơn
-**Argument**: Projection fits đã là "đúng metric" cho routing → softmax chỉ normalize thành probability distribution → KHÔNG cần mechanism phức tạp hơn (OT, learned gating, etc.)
-**Analogy**: Nếu ta có thermometer đo chính xác nhiệt độ, ta KHÔNG cần neural network để quyết định "nóng hay lạnh" — chỉ cần threshold/softmax. Tương tự, projection fit ĐÃ là measurement chính xác cho "expert nào phù hợp" → softmax là đủ.
-**Occam's Razor**: Simple mechanism + correct metric > Complex mechanism + proxy metric
-## 4.5 Phản biện tiềm năng và trả lời
-**Q1: "Projection routing quá đơn giản, không đủ contribution"**
-A1: Contribution không nằm ở complexity mà nằm ở:
-- (a) Insight rằng spectral signatures từ frozen weights ĐỦ cho routing (C1)
-- (b) Chứng minh rằng parameter-free routing LOẠI BỎ HOÀN TOÀN routing forgetting — đây là lý thuyết guarantee, không phải empirical observation
-- (c) Elimination methodology: remove trans_input (MLP) + prompt_key + 3 GPM sets + memory replay → simpler AND better
-**Q2: "Softmax routing đã được biết — đâu là novelty?"**
-A2: Novelty nằm ở **routing signal**, không phải routing function:
-- Standard MoE: softmax over learned logits → softmax of WHAT matters
-- SpecRoute: softmax over weighted projection fits derived from spectral signatures → the FIT computation is novel, softmax is just normalization
-**Q3: "Tại sao weighted projection tốt hơn unweighted?"**
-A3: Singular value weighting $\sigma_i^2$ ưu tiên directions mà expert sử dụng MẠNH NHẤT. Nếu expert A sử dụng direction $v_1$ mạnh ($\sigma_1 = 5$) và direction $v_2$ yếu ($\sigma_2 = 0.1$), thì input aligned với $v_1$ nên được route tới A mạnh hơn input aligned với $v_2$. Unweighted projection không capture sự khác biệt này.
----
-# PHẦN 5: SUMMARY — THAY ĐỔI SO VỚI IDEA CŨ
-| Thành phần | Idea cũ | Idea mới | Lý do thay đổi |
-|-----------|---------|---------|----------------|
-| **C1** | Spectral LoRA Signatures | Spectral LoRA Signatures **(tăng cường motivation task boundary)** | Phản biện yêu cầu chứng minh rõ hơn tại sao > prompt key |
-| **C2** | ~~Grassmann-OT Routing~~ | **Projection-Based Parameter-Free Routing** | OT thiếu motivation, suy biến tại batch=1, balance không cần cho CL |
-| **C3** | Elastic Subspace Allocation | Elastic Subspace Allocation **(giữ nguyên)** | Không bị ảnh hưởng bởi phản biện |
-| **Code** | ~~Cần implement OT~~ | **Code đã đúng** (projection routing đã implement) | Code đi trước idea document |
-## Key changes in narrative:
-1. **Kill "Grassmann-OT"** — thay bằng "Projection-Based Routing"
-2. **Tên C2 mới**: "Subspace Projection Routing" hoặc "Parameter-Free Spectral Routing"
-3. **Motivation chain**: weakness-driven (3 concrete weaknesses of GainLoRA) thay vì novelty-driven ("OT chưa ai dùng")
-4. **Strengthen C1**: thêm task boundary analysis (mục 2.2, Lý do 5)
-5. **Grassmann geometry vẫn giữ**: dùng cho ANALYSIS (đo subspace distance, principal angles) — KHÔNG dùng cho routing mechanism
-## Không cần thay đổi code:
-- `t5_specroute.py`: `compute_spectral_routing()` đã implement projection-based softmax routing ✅
-- `cl_trainer_specroute.py`: không có OT code ✅
-- `run_t5.py`: không ảnh hưởng ✅
-## Cần thay đổi idea document:
-- Loại bỏ mọi references tới OT, Sinkhorn, transport plan
-- C2 = "Projection-Based Routing" with weighted Rayleigh quotient
-- Motivation section rewrite theo weakness → insight → solution chain

human_working_IdeaMethod_and_discuss/comprehensive_methodology.md DELETED Viewed

@@ -1,663 +0,0 @@
-# PHÂN TÍCH PHÊ BÌNH VÀ HỆ THỐNG HÓA Ý TƯỞNG TỪ DISCUSSTION.TXT
-## Từ lập luận thô → Kiểm chứng → Phản biện → Đề xuất phương pháp luận
-**Ngày**: 9 tháng 3, 2026
-**Phương pháp**: Trích xuất các ý tưởng của người nghiên cứu từ nửa sau discusstion.txt → tách khỏi flattery AI → kiểm chứng từng ý bằng toán + literature → phản biện → hệ thống hóa thành methodology
-**Nguyên tắc**: Tài liệu này *không* re-explain SpecRoute hay GainLoRA. Tài liệu này tập trung vào **các ý tưởng gốc của bạn** — phân tích cái đúng, cái sai, cái bị AI overstate, và xây dựng methodology từ phần solid.
----
-## I. PROBLEM DEFINITION — Không Phải "Improve Routing", Mà Là "What Is The Right Problem?"
-### 1.1 Setting chính thức
-Cho:
-- Pre-trained backbone $W_0 \in \mathbb{R}^{d_{out} \times d_{in}}$ (frozen)
-- Chuỗi $T$ tasks đến tuần tự: $\mathcal{T}_1, \mathcal{T}_2, ..., \mathcal{T}_T$
-- Mỗi task $\mathcal{T}_t$ có dataset $\mathcal{D}_t = \{(x_i^{(t)}, y_i^{(t)})\}$ chỉ available trong giai đoạn train task $t$
-Constraints:
-- **Zero-replay**: Khi train task $t$, không có $\mathcal{D}_{t'}, t' < t$
-- **No task-ID at inference**: Tại test time, không biết $x$ thuộc task nào
-- **Expandable LoRA**: Mỗi task $t$ thêm LoRA branch $\Delta W_t = B_t A_t$ (rank $r$), freeze sau khi train xong
-Sau $T$ tasks, forward pass cho input $h$:
-$$\text{output}(h) = W_0 h + \sum_{t=1}^{T} w_t(h) \cdot B_t A_t h$$
-trong đó $w_t(h) \in [0,1]$ là routing weight.
-### 1.2 Ba sub-problems không thể tách rời
-Bất kỳ phương pháp nào trong setting này đều phải giải **đồng thời** 3 bài toán:
-| Sub-problem | Đầu vào | Đầu ra | Constraint |
-|-------------|---------|--------|------------|
-| **R: Routing** | Input $h$, expert set $\{\Delta W_t\}$ | Weights $w(h) \in \mathbb{R}^T$ | No task-ID, computable from $h$ alone |
-| **P: Protection** | New task gradient $\nabla_{\theta} \mathcal{L}_T$ | Projected gradient $\tilde{\nabla}$ | Old experts' functionality preserved |
-| **A: Allocation** | Available subspace $M^{\perp}$, new task demand | How much of $M^{\perp}$ to use | Fair capacity across tasks |
-**Tại sao không thể tách rời?**
-Routing quality phụ thuộc vào expert isolation (P), vì nếu new task can thiệp old expert → routing signal bị corrupt. Expert isolation phụ thuộc vào subspace budget (A), vì tight orthogonal constraint → good isolation nhưng limited capacity. Capacity limitation ảnh hưởng chất lượng expert → ảnh hưởng routing relevance.
-Vòng tròn: **R ← P ← A ← R**.
-### 1.3 Tại sao đây KHÔNG phải bài toán MoE
-Mixture of Experts (trong LLM) và expandable LoRA CL trông giống nhau (nhiều expert, cần routing) nhưng khác biệt bản chất:
-| Aspect | MoE (LLM) | Expandable LoRA CL |
-|--------|-----------|---------------------|
-| Expert creation | Đồng thời (jointly trained) | Tuần tự (each expert only sees its task) |
-| Routing | Learned gating, optimized end-to-end | Cannot learn across tasks (forgetting risk) |
-| Load balancing | Desirable (use all experts equally) | NOT desirable (want SELECTIVE activation) |
-| Expert overlap | Expected, managed by auxiliary losses | Constrained by orthogonal projection |
-| Data at routing time | All data available | Zero-replay → only current data |
-Hệ quả: **Mọi technique của MoE routing (OT balancing, learned gating, regularization) đều không directly applicable.** Cần routing mechanism riêng cho CL setting.
----
-## II. INFORMATION LANDSCAPE — Cái Gì Hợp Lệ, Cái Gì Vi Phạm?
-### 2.1 Taxonomy of available information
-Sau khi train xong task $t$, trước khi quên $\mathcal{D}_t$, ta có thể extract và lưu:
-| Loại thông tin | Ví dụ | Hợp lệ? | Lý do |
-|---------------|-------|---------|-------|
-| **Model parameters** | Frozen $A_t, B_t$, GPM bases $U_t$ | ✅ | Là artifact của quá trình train, không phải data |
-| **Derived quantities from parameters** | SVD of $\Delta W_t = U_t \Sigma_t V_t^T$ | ✅ | Computed from model params alone |
-| **Data statistics** | Mean features $\mu_t$, covariance $\Sigma_t$ | ❌ | Summary of $\mathcal{D}_t$ → violates zero-replay |
-| **Distribution parameters** | vMF $(\mu_t, \kappa_t)$ | ❌ | Fitted on $\mathcal{D}_t$ → violates zero-replay |
-| **Auxiliary learned params** | Prompt keys, trans_input MLPs | ⚠️ Hợp lệ nhưng có forgetting risk | Phải train → gradient update có thể corrupt old |
-### 2.2 Phân biệt tinh tế: GPM bases vs data statistics
-GPM computation:
-1. Forward pass data qua LoRA → collect input covariance matrix $C_t \in \mathbb{R}^{d \times d}$
-2. SVD: $C_t = U_t S_t V_t^T$ → lấy principal directions $U_t[:, :k]$
-3. Lưu $U_t[:, :k]$ (directions), BỎ $S_t$ (magnitudes)
-Tại sao h��p lệ? Vì GPM bases encode **hướng (directions)** mà LoRA input hoạt động — đây là property của model + data combination mà cần forward pass để extract. Tuy nhiên, chỉ giữ lại **subspace** (span of directions), không giữ **distribution** (how data distributes within subspace).
-**Lằn ranh đỏ**: Nếu một method lưu mean feature vector $\mu_t = \frac{1}{N}\sum_i f(x_i^{(t)})$ → đây là data statistic, vi phạm zero-replay. Feature Distributions paper (ICML 2025) làm chính xác điều này — cần position rõ ràng.
-### 2.3 Hệ quả cho routing design
-Từ Section 2.1, routing mechanism chỉ được sử dụng:
-1. **Frozen model parameters**: $\{A_t, B_t\}_{t=1}^{T}$, frozen backbone $W_0$
-2. **Quantities derived from frozen parameters**: SVD, norms, angles, etc.
-3. **Current input** $h$ tại inference time
-Routing **KHÔNG ĐƯỢC** sử dụng:
-1. Learned parameters (prompt keys, gating networks) → forgetting risk
-2. Data statistics từ old tasks (means, distributions) → zero-replay violation
-3. Task labels → no task-ID
-**Proposition 1**: *Trong zero-replay expandable LoRA CL, routing mechanism parameter-free (derived entirely from frozen expert weights + current input) là thỏa mãn tất cả constraints.*
-*Lưu ý*: Đây không có nghĩa learned routing "sai" — GainLoRA dùng learned params + GPM protection cho routing params → hợp lệ nhưng cần thêm mechanism (GPM for trans_input, per-step projection). Parameter-free routing loại bỏ nhu cầu các mechanism phụ này.
----
-## III. EXPERT CHARACTERIZATION — Từ Frozen Weights Đến Task Identity
-### 3.1 Fundamental question: "Expert này LÀM GÌ?"
-Mỗi frozen expert $\Delta W_t = B_t A_t \in \mathbb{R}^{d_{out} \times d_{in}}$ thực hiện:
-$$h \mapsto \Delta W_t h = B_t (A_t h)$$
-Từ SVD: $\Delta W_t = U_t \Sigma_t V_t^T$, decompose thành:
-- $V_t^T h$: **Project input** lên principal input directions (WHAT the expert "looks at")
-- $\Sigma_t$: **Scale** each projected component (HOW MUCH the expert cares)
-- $U_t$: **Map to output** space (WHERE the expert "writes")
-### 3.2 Spectral Signature: định nghĩa chính thức
-**Definition**: *Spectral signature* của expert $t$ là cặp:
-$$\mathcal{S}_t = \{(v_{t,i}, \sigma_{t,i})\}_{i=1}^{r}$$
-trong đó $v_{t,i}$ là right singular vector thứ $i$ (input direction), $\sigma_{t,i}$ là singular value tương ứng.
-**Tại sao dùng right singular vectors (V) chứ không phải left (U)?**
-Routing quyết định từ **input** $h$ → cần so sánh $h$ với **input directions** mà expert listens to. Right singular vectors $V_t$ chính là "input space receptors" của expert. Left singular vectors $U_t$ encode output space — relevant cho aggregation, không phải routing.
-### 3.3 Projection Fit: đo lường "expert $t$ relevant tới input $h$ bao nhiêu?"
-**Definition**: *Weighted Projection Fit* của expert $t$ cho input $h$:
-$$\text{fit}_t(h) = \frac{\sum_{i=1}^{r} \sigma_{t,i}^2 (v_{t,i}^T h)^2}{\sum_{i=1}^{r} \sigma_{t,i}^2 \cdot \|h\|^2}$$
-**Giải thích từng thành phần**:
-- $(v_{t,i}^T h)^2$: bao nhiêu "năng lượng" của $h$ nằm theo hướng $v_{t,i}$
-- $\sigma_{t,i}^2$: expert coi trọng hướng $v_{t,i}$ bao nhiêu (singular values lớn = modification mạnh)
-- $\|h\|^2$: chuẩn hóa
-- Tử số: tổng weighted projection energy
-- Mẫu số: maximum possible (khi $h$ nằm hoàn toàn trong span($V_t$))
-**Tính chất toán học**:
-- $\text{fit}_t(h) \in [0, 1]$
-- $\text{fit}_t(h) = 1 \iff h \in \text{span}(V_t)$ và $h$ aligned với dominant singular vectors
-- $\text{fit}_t(h) = 0 \iff h \perp \text{span}(V_t)$ (expert hoàn toàn không "thấy" input)
-**Liên hệ với Rayleigh Quotient**:
-Nếu ta define $M_t = V_t \text{diag}(\sigma_t^2) V_t^T$ (PSD matrix), thì:
-$$\text{fit}_t(h) = \frac{h^T M_t h}{\text{tr}(M_t) \cdot h^T h}$$
-Đây chính là **normalized Rayleigh quotient** — công cụ chuẩn trong spectral theory, KHÔNG phải construction ad hoc.
-### 3.4 Tại sao projection fit là "đúng" cho bài toán này? (Và tại sao nó có thể "sai")
-**Tại sao đúng (theoretical argument)**:
-1. **Respect expert structure**: fit_t(h) được derive trực tiếp từ SVD of expert weights — encode what the expert WAS TRAINED to do.
-2. **Per-input**: Mỗi $h$ khác nhau cho projection fit khác nhau → supports mixed-task batches (crucial for real inference).
-3. **Parameter-free**: Computed from frozen quantities + current input → no forgetting risk.
-4. **Discriminative by construction**: Nếu experts operate trên orthogonal input subspaces (guaranteed approximately by GPM), thì:
-   $$\text{span}(V_t) \approx \perp \text{span}(V_{t'}), \quad t \neq t'$$
-   $$\Rightarrow \text{fit}_t(h) \text{ high} \implies \text{fit}_{t'}(h) \text{ low for } t' \neq t$$
-**Tại sao có thể sai (honest caveats)**:
-1. **Modification energy ≠ modification quality**: $\sigma_{t,i}^2 (v_{t,i}^T h)^2$ đo expert sẽ **modify mạnh** input $h$ theo hướng $v_{t,i}$. Nhưng modification mạnh KHÔNG có nghĩa là modification ĐÚNG. Expert có thể modify mạnh nhưng sai hướng output.
-   *Counter-argument*: Expert được train trên task $t$ → modification patterns encode task-relevant transformations. Projection fit cao → input tương tự training distribution → modification likely correct. Nhưng đây là **assumption**, không phải guarantee.
-2. **GPM orthogonality là approximate**: Thực tế, null-space projection không hoàn hảo. Subspace overlap nhỏ vẫn tồn tại → discriminative property bị weakened.
-3. **Mean pooling loses structure**: Cả GainLoRA và SpecRoute dùng `avg_inputs_embeds = mean(token_embeddings)` cho routing. Hai sequences có content khác nhau nhưng similar average → misrouted.
----
-## IV. ROUTING MECHANISM — Derive from Principles
-### 4.1 Formulation: routing as maximum likelihood expert assignment
-Cho input $h$, routing weights $w(h) = [w_1(h), ..., w_T(h)]$ sao cho weighted combination approximates oracle:
-$$\sum_{t=1}^{T} w_t(h) \cdot \Delta W_t h \approx \Delta W_{\text{oracle}(h)} h$$
-trong đó $\text{oracle}(h)$ là expert "đúng" (trained on task mà $h$ thuộc về).
-### 4.2 Competitive routing (softmax) vs. Independent gating (sigmoid)
-**Independent gating (GainLoRA)**:
-$$w_t(h) = |2\sigma(4 \cdot \text{cos}(k_t, f_t(h))) - 1| \quad \in [0, 1] \text{ independently}$$
-*Ưu điểm*: Cho phép multiple experts fire đồng thời (useful nếu task mới overlap concept cũ).
-*Nhược điểm*:
-- $\sum_t w_t(h) \neq 1$ → modification magnitude thay đổi theo số experts → scale instability
-- Tất cả experts có thể fire simultaneously → blurring
-- Cho phép $\sum_t w_t = 0$ → no modification at all → information loss
-**Competitive routing (softmax)**:
-$$w_t(h) = \frac{\exp(\text{fit}_t(h) / \tau)}{\sum_{t'} \exp(\text{fit}_{t'}(h) / \tau)}$$
-*Ưu điểm*:
-- $\sum_t w_t(h) = 1$ → constant modification energy → stable training
-- Forces **competition** → natural selection of most relevant expert(s)
-- $\tau \to 0$: hard routing (winner-take-all); $\tau \to \infty$: uniform averaging
-*Nhược điểm*:
-- Phải assign TOÀN BỘ weight → nếu input không thuộc task nào rõ ràng, vẫn phải "chọn"
-- Soft assignment → mỗi expert vẫn contribute dù ít → small interference
-**Trong CL setting**: Competitive routing phù hợp hơn vì:
-1. Tasks non-overlapping → mỗi input thuộc đúng 1 task → competition là đúng inductive bias
-2. Scale stability quan trọng hơn flexibility (15 tasks × 48 layers × 2 projections = many routing decisions)
-3. GPM already ensures expert isolation → independent gating phải học isolation from scratch (redundant)
-### 4.3 Thuật toán routing hoàn chỉnh
-```
-INPUT: h ∈ R^{d_model}  (averaged input embedding)
-       {S_t}_{t=1}^{T}  (spectral signatures: {V_t, σ_t} for each layer, each projection)
-       τ > 0             (temperature)
-FOR EACH ENCODER LAYER l, PROJECTION TYPE p ∈ {Q, V}:
-  FOR EACH TASK t = 1, ..., T:
-    V_t^{(l,p)}, σ_t^{(l,p)} = S_t[l, p]     # frozen spectral signature
-    proj = V_t^{(l,p)} h                       # project input onto expert's input space
-    fit_t^{(l,p)} = Σ_i σ²_{t,i} proj²_i / (Σ_i σ²_{t,i} · ||h||²)
-  END FOR
-  # Average fit across layers (global routing decision)
-  fit_t = mean over (l, p) of fit_t^{(l,p)}
-  # Competitive routing
-  w(h) = softmax([fit_1, ..., fit_T] / τ)
-RETURN w(h) ∈ R^T, Σ_t w_t = 1
-```
-**Lưu ý implementation**: Trong code hiện tại, fit scores được average chỉ over encoder layers (consistent — routing decision từ encoder, apply cho cả decoder). Decoder không tham gia routing computation.
-### 4.4 Special case: current task (đang train)
-Khi đang train task $T$, LoRA weights $(A_T, B_T)$ chưa frozen → chưa có spectral signature.
-**Giải pháp hiện tại**: Dùng rows of $A_T$ trực tiếp (thay vì SVD) — vì khi $r$ nhỏ (=4), $\Delta W = BA$ có rank $r$, và $A$ (khi normalized) approximate right singular vectors.
-**Giải thích**: Cho $\Delta W = BA$, nếu $B^T B = I$ (orthonormal), thì SVD of $\Delta W$ có $V = $ rows of $A$ (up to scaling). Trong thực tế $B^T B \neq I$, nên đây là approximation. Nhưng tại $r=4$, sai số nhỏ.
-**Hệ quả**: Fit cho current task:
-$$\text{fit}_T(h) = \frac{\|A_T h\|^2}{r \cdot \|h\|^2}$$
-(unweighted, vì chưa có singular values — treat all directions equally)
----
-## V. ANTI-FORGETTING — Gradient Projection as Structural Isolation
-### 5.1 Bài toán: bảo vệ expert cũ khi train expert mới
-Khi train task $T$, gradient $\nabla_{A_T} \mathcal{L}_T$ có thể vô tình interfere với experts cũ thông qua **shared representation space** (cùng backbone $W_0$, cùng input space $\mathbb{R}^{d_{in}}$).
-Cách interference xảy ra:
-1. Input $h$ cho old task $t$ đi qua new expert $T$ (routing error)
-2. New expert $T$ train trên subspace overlap với old expert $t$ → modify shared directions
-### 5.2 GPM (Gradient Projection Memory) — mechanism chính
-**Idea**: Đảm bảo new LoRA operates trong **null-space** của old LoRA input subspaces.
-**Formalization**: Gọi $\mathcal{M}_t = \text{span}(U_t^{GPM})$ là input subspace that expert $t$ uses. Accumulated protected subspace:
-$$\mathcal{M}_{1:T-1} = \text{span}\left(\bigcup_{t=1}^{T-1} U_t^{GPM}\right)$$
-*(incremental — có thể compute bằng progressive SVD update)*
-New LoRA $A$ initialization:
-$$A_T = A_T^{init} - \text{Proj}_{\mathcal{M}_{1:T-1}}(A_T^{init})$$
-trong đó $\text{Proj}_{\mathcal{M}}(X) = U_{\mathcal{M}} U_{\mathcal{M}}^T X$ (project onto old subspace, then subtract).
-**Guarantee**: $A_T h \perp \mathcal{M}_{1:T-1}$ for all $h$, i.e., new LoRA input activations are orthogonal to old LoRA input activations.
-### 5.3 Per-step projection (cần thiết khi có learned routing params)
-GainLoRA có `trans_input` (MLP) và `prompt_key` là learned parameters → mỗi optimizer step phải project gradient update:
-```python
-# After optimizer.step():
-new_weight = current_weight - project_onto_old_subspace(current_weight - old_weight)
-```
-SpecRoute loại bỏ learned routing params → **KHÔNG CẦN** per-step projection cho routing. Chỉ cần GPM cho LoRA layers.
-**Hệ quả thực tế**: SpecRoute training loop đơn giản hơn significatv (no custom `_inner_training_loop`, no per-step weight manipulation, use base class trainer).
-### 5.4 Interaction giữa GPM và routing
-**Key insight**: GPM + spectral routing tạo **dual protection**:
-1. **GPM** (structural): New expert operates in orthogonal subspace → CAN'T interfere with old expert outputs
-2. **Spectral routing** (functional): Old-task inputs routed to old experts → WON'T be processed by new expert
-Individually, mỗi mechanism leaky:
-- GPM alone: orthogonality approximate, small interference possible
-- Routing alone: misrouting → wrong expert processes input
-Together: even if routing makes small mistake, GPM ensures interference is orthogonal (small). Even if GPM leaks slightly, routing directs input to correct expert.
-**Điều này nghĩa là**: Ta không cần perfect routing NOR perfect orthogonality — chỉ cần cả hai "tốt vừa đủ" để bù cho nhau.
----
-## VI. SUBSPACE ALLOCATION — The Honest Hard Problem
-### 6.1 Bài toán capacity
-Input space $\mathbb{R}^{d_{in}}$ (d=1024 cho T5-Large). Mỗi task claim subspace of dimension ≤ $k_t$ cho GPM. Available null-space:
-$$\dim(\mathcal{M}_{1:T}^{\perp}) = d - \dim(\mathcal{M}_{1:T}) \geq d - \sum_{t=1}^{T} k_t$$
-Với $T = 15$ tasks, nếu mỗi task claim $k = 60$ dims: $1024 - 900 = 124$ dims remaining → **tight but feasible**.
-### 6.2 Threshold controls capacity
-GPM threshold $\epsilon$ controls $k_t$: higher threshold → more directions retained → larger $k_t$ → faster exhaustion.
-| Strategy | Formula | Effect |
-|----------|---------|--------|
-| **GainLoRA original** | $\epsilon_t = (1-\epsilon_0) \cdot t/T + \epsilon_0$ | Tăng dần → early tasks protect nhiều, late tasks protect ít. **Unfair**: early tasks "chiếm" subspace disproportionately. |
-| **Constant threshold** (SpecRoute) | $\epsilon_t = \epsilon_0, \forall t$ | Mỗi task protect cùng tỷ lệ. **Fair** nhưng vẫn linear depletion. |
-| **Importance-weighted** (NOT YET IMPLEMENTED) | $k_t$ allocated based on task complexity | Potentially optimal nhưng cần metric cho "importance" |
-### 6.3 Thẳng thắn: ESA (Elastic Subspace Allocation) hiện tại yếu
-Cái gọi là "ESA" trong SpecRoute thực tế chỉ là **thay đổi threshold schedule từ tăng dần sang hằng số**. Đây là hyperparameter change, không phải algorithmic contribution.
-**Nếu muốn ESA thực sự contributes**, cần ít nhất 1 trong:
-1. **Importance-weighted protection**: Singular values lớn ($\sigma_i$ lớn) → direction quan trọng cho expert → protect mạnh hơn. Singular values nhỏ → direction ít quan trọng → có thể release cho future tasks.
-   $$k_t = \min\{k : \sum_{i=1}^{k} \sigma_i^2 / \sum_j \sigma_j^2 \geq \epsilon\}$$
-   Hiện tại SpecRoute KHÔNG dùng singular values trong GPM decision — chỉ dùng input covariance SVD (khác).
-2. **Subspace recycling**: Detect directions trong $\mathcal{M}_{1:T-1}$ mà không expert nào dùng actively (routing weight luôn ~0) → release.
-3. **Adaptive threshold based on remaining capacity**: $\epsilon_t = f(d - \dim(\mathcal{M}_{1:t-1}))$ — threshold giảm khi subspace cạn → force later tasks to be more selective.
-**Status**: Cả 3 đều chưa implement. Bất kỳ cái nào nếu implement + ablation study → mới thực sự là contribution.
----
-## VII. REPRESENTATION DRIFT — The Elephant in the Room
-### 7.1 Vấn đ��
-Spectral signatures $\{V_t, \sigma_t\}$ frozen → KHÔNG drift. Nhưng input embedding $h$ **CÓ drift**.
-**Cơ chế drift**: Trong encoder/decoder architecture, output of layer $l$:
-$$h^{(l+1)} = f\left(W_0^{(l)} h^{(l)} + \sum_t w_t(h^{(0)}) \cdot B_t^{(l)} A_t^{(l)} h^{(l)}\right)$$
-Khi thêm LoRA branch mới (task $T+1$), $w_t$ thay đổi (vì thêm competitor) → $h^{(l+1)}$ thay đổi → cascade qua layers.
-**Hệ quả**: Projection fit $\text{fit}_t(h)$ tại task $T+1$ khác so với task $T$, dù $V_t, \sigma_t$ giữ nguyên — vì $h$ thay đổi.
-### 7.2 So sánh: GainLoRA Handle drift bằng cách nào?
-GainLoRA dùng **previous_trans_input** — frozen MLP snapshot per task. Mỗi old task $t$ có riêng:
-$$f_t(x) = \text{SiLU}(W^{out}_t \cdot \text{SiLU}(W^{in}_t \cdot x))$$
-Routing: compute $f_t(\bar{h})$ rồi cosine similarity với frozen prompt_key $k_t$.
-**Ý tưởng**: Mỗi expert "nhìn" input qua "lăng kính" riêng (frozen MLP), expect cosine similarity patterns từ khi nó được train. Input có thể drift, nhưng prompt_key + trans_input snapshot là "matched pair" → somehow robust.
-**Nhưng vẫn leaky**: $\bar{h}$ (average input embedding) vẫn drift → $f_t(\bar{h})$ output khác → cosine similarity thay đổi. Frozen MLP + frozen key KHÔNG fully compensate cho input drift, chỉ reduce sensitivity.
-### 7.3 SpecRoute: explicitly acknowledge drift, don't pretend to solve it
-SpecRoute claim "zero routing forgetting" — chính xác hơn nên nói:
-> **"Zero parameter drift in routing mechanism"** — routing computation không có learned parameters nên không có parameter forgetting. Nhưng **representation drift** (thay đổi trong input embeddings do accumulated LoRA effects) vẫn tồn tại.
-**Tại sao representation drift có thể manageable (hypothesis, chưa proven)**:
-1. **LoRA rank nhỏ** ($r = 4$): Mỗi task chỉ modify rank-4 subspace. Total modification after 15 tasks: rank ≤ 60 (nếu orthogonal). Trong space 1024-dim, đây là ~6% dimensions → $h$ drift nhỏ.
-2. **GPM ensures orthogonal modification**: New task modify directions mà old task KHÔNG dùng → old task's projection space ít bị ảnh hưởng.
-3. **Backbone frozen**: $W_0$ không thay đổi → bulk of transformation stable. LoRA chỉ thêm residual.
-**Cần kiểm chứng thực nghiệm**:
-- Đo $\|\text{fit}_t(h) \text{ at task } T - \text{fit}_t(h) \text{ at task } t\|$ qua tasks
-- If drift small → hypothesis confirmed
-- If drift large → need explicit drift compensation mechanism
-### 7.4 Potential mitigation (chưa implement, nhưng well-defined)
-Nếu representation drift nghiêm trọng, options:
-1. **Snapshot input normalization**: Store $\mu_t^{proj}, \sigma_t^{proj}$ (mean/std of projected features at training time) → normalize at inference: $\hat{h} = (h - \mu_t^{proj})/\sigma_t^{proj}$ trước khi compute fit.
-   - **Vấn đề**: $\mu_t^{proj}$ là data statistic → có thể vi phạm zero-replay
-   - **Counter**: chỉ cần mean/std of LoRA LAYER output (model output, not data) — ambiguous territory
-2. **Relative fit**: Thay vì absolute fit $\text{fit}_t(h)$, dùng relative ranking. Distribution shift affects all fits similarly → ranking preserved.
-   - Softmax inherently does this partially (chỉ care ordering, not absolute values)
-3. **Self-calibration**: Periodically (every $k$ tasks), recompute spectral signatures on new LoRA weights.
-   - Nhưng old LoRA weights frozen → signatures không thay đổi → chỉ current task affected → not helpful
----
-## VIII. THE COMPLETE ALGORITHM — End to End
-### 8.1 Training phase (cho task $T$)
-```
-INPUTS: Pre-trained backbone W₀
-        Frozen experts {(A_t, B_t)}_{t=1}^{T-1}
-        Spectral signatures {S_t}_{t=1}^{T-1}
-        GPM bases {M_{1:T-1}}
-        Training data D_T
-STEP 1 — Initialize new LoRA branch:
-  A_T^{init} ← random (Kaiming)
-  A_T ← A_T^{init} - Proj_{M_{1:T-1}}(A_T^{init})   # null-space projection
-  B_T ← 0 OR random (scaled small)
-STEP 2 — Train with routing:
-  for each batch (x, y) in D_T:
-    h̄ ← mean_pool(encoder_embed(x))                   # average input embedding
-    w(h̄) ← spectral_routing(h̄, {S_t}_{t<T}, A_T)     # Section IV.3
-    for each layer l:
-      LoRA_output_l ← Σ_t w_t(h̄) · B_t^(l) A_t^(l) h^(l)  # weighted aggregation
-      h^(l+1) ← layer_l(h^(l)) + LoRA_output_l
-    loss ← task_loss(output, y)
-    loss.backward()
-    # Only A_T and B_T have gradients (others frozen)
-    optimizer.step()                                     # No per-step projection needed
-STEP 3 — End of task:
-  Freeze A_T, B_T
-  # Compute spectral signature
-  ΔW_T = B_T @ A_T
-  U, Σ, V^T = SVD(ΔW_T)
-  S_T = {V[:r], Σ[:r]}                                 # store for future routing
-  # Update GPM
-  Compute input covariance from forward passes
-  SVD → extract top-k directions
-  M_{1:T} = M_{1:T-1} ∪ new_directions
-  Save: {A_T, B_T, S_T, M_{1:T}}
-  Discard: D_T (zero-replay)
-```
-### 8.2 Inference phase
-```
-INPUT: Test sample x (no task-ID)
-STEP 1 — Encode + route:
-  h̄ ← mean_pool(encoder_embed(x))
-  w(h̄) ← softmax([fit_1(h̄), ..., fit_T(h̄)] / τ)
-STEP 2 — Forward with routing:
-  for each layer l:
-    LoRA_output_l ← Σ_t w_t(h̄) · B_t^(l) A_t^(l) h^(l)
-    h^(l+1) ← layer_l(h^(l)) + LoRA_output_l
-STEP 3 — Decode output
-```
-### 8.3 Complexity analysis
-| Operation | GainLoRA | SpecRoute | Comment |
-|-----------|----------|-----------|---------|
-| Routing computation | $O(T \cdot d \cdot h_{mlp} + T \cdot d)$ | $O(T \cdot r \cdot d \cdot L)$ | SpecRoute: matrix-vector per layer per task |
-| Trainable routing params | $O(2 \cdot d \cdot h_{mlp} + d)$ per task | $0$ | SpecRoute: no routing params |
-| GPM targets | LoRA + trans_input + prompt_key | LoRA only | SpecRoute: simpler GPM |
-| Per-step overhead | Null-space projection for routing params | None | SpecRoute: standard training loop |
-| End-of-task | GPM + freeze + save snapshots | GPM + freeze + SVD | SVD is $O(d_{out} \cdot d_{in} \cdot r)$ — cheap for small $r$ |
-| Memory per task | $A_t, B_t$ + prompt_key + trans_input weights | $A_t, B_t$ + spectral sig $(V_t, \sigma_t)$ | Similar; spectral sig slightly smaller than trans_input |
----
-## IX. POSITIONING IN THE LANDSCAPE
-### 9.1 So sánh phương pháp-agnostic
-| Criterion | GainLoRA | InfLoRA | MINGLE | Feature Dist. | TreeLoRA | SpecRoute |
-|-----------|----------|---------|--------|---------------|----------|-----------|
-| Routing type | Learned (MLP+key) | None (equal weight) | Learned (MoE gate) | Feature similarity | Gradient similarity | Spectral projection |
-| Routing forgetting risk | ⚠️ Managed by GPM | N/A | ⚠️ Managed by EMA | ❌ Stores data stats | ⚠️ Needs old gradients | ✅ Parameter-free |
-| Zero-replay | ✅ | ✅ | ✅ | ⚠️ Stores mean features | ⚠️ Needs gradient similarity | ✅ |
-| Anti-forgetting | GPM on LoRA + routing | Null-space init | OGP (orthogonal) | None explicit | None explicit | GPM on LoRA only |
-| Subspace allocation | Increasing threshold | Fixed threshold | EMA relaxation | N/A | N/A | Constant threshold |
-| Aggregation | Weighted sum (sigmoid) | Equal sum | Top-k MoE | Weighted sum | Tree selection | Weighted sum (softmax) |
-### 9.2 Novelty assessment (honest)
-**Clearly novel**:
-- Using SVD of frozen LoRA weights (not data features, not learned keys) as routing signal — no prior work does exactly this.
-- Elimination of ALL learned routing parameters in expandable LoRA CL — GainLoRA, MINGLE both require learned routing.
-**Partially novel**:
-- Weighted Rayleigh quotient for routing — Rayleigh quotient is textbook, but application to LoRA-CL routing is new.
-- Demonstrating that parameter-free routing + GPM = sufficient (if it works empirically) — conceptual contribution.
-**NOT novel**:
-- GPM/null-space projection — from InfLoRA, GainLoRA
-- Expandable LoRA architecture — from O-LoRA, InfLoRA, GainLoRA
-- Softmax routing in MoE-like structures — foundational MoE work
-- SVD as analysis tool for LoRA — SD-LoRA analyzes magnitude/direction
-**Closest competitor**: Feature Distributions (ICML 2025) — stores characterization per expert, uses similarity for routing. Key difference: they store data-level features (mean activation vectors), we store weight-level signatures (SVD of frozen params). They arguably violate or stretch zero-replay; we don't.
----
-## X. WHAT NEEDS TO BE TRUE — Assumptions Checklist
-Mỗi assumption dưới đây CẦN PHẢI TRUE để methodology work. Mỗi cái cần empirical validation.
-### 10.1 Core assumptions
-| # | Assumption | Status | How to test |
-|---|-----------|--------|-------------|
-| A1 | Projection fit correlates with "correct expert" assignment | ❓ UNTESTED | Compute fit accuracy on task-boundary evaluation sets |
-| A2 | GPM+routing dual protection sufficient to prevent forgetting | ❓ UNTESTED | Compare forgetting metric with vs without routing |
-| A3 | Representation drift is small enough to not corrupt routing | ❓ UNTESTED | Track fit_t(h) variance across tasks for fixed test inputs |
-| A4 | mean_pool captures enough task-relevant signal for routing | ❓ UNTESTED | Compare with max_pool, CLS token, attention-weighted pool |
-| A5 | Softmax temperature τ is not overly sensitive | ❓ UNTESTED | τ ablation study |
-| A6 | rank r=4 is sufficient for spectral signatures to be discriminative | ❓ UNTESTED | r ablation |
-### 10.2 Implied assumptions (from GainLoRA that we inherit)
-| # | Assumption | Status |
-|---|-----------|--------|
-| A7 | T5-Large backbone generalizable to other architectures (LLaMA) | Partially tested (GainLoRA has LLaMA configs) |
-| A8 | 15 tasks is within GPM capacity for d=1024 | Expected (d=1024 >> 15*r*2) |
-| A9 | Q and V projections sufficient (not K) | From GainLoRA design, standard in LoRA literature |
----
-## XI. EXPERIMENTAL VALIDATION PLAN
-### 11.1 What the experiments MUST show (not "nice to have")
-1. **SpecRoute vs. GainLoRA on identical setting**: Same data, same preprocessing, same evaluation protocol. Show routing improves OR at least matches.
-2. **Routing accuracy analysis**: On held-out validation sets of old tasks, what fraction of inputs are correctly routed (highest weight to correct expert)?
-3. **Forgetting curve**: Plot per-task performance after each subsequent task. Compare degradation.
-4. **Representation drift measurement**: For fixed test inputs from task $t$, track $\text{fit}_t(h)$ value as tasks $t+1, ..., T$ are added. If fit_t(h) drops significantly → drift is a problem.
-### 11.2 Ablation studies (ranked by importance)
-1. **Routing mechanism**: Spectral projection vs. prompt key (use SpecRoute architecture but GainLoRA routing) vs. random routing vs. uniform routing
-2. **Aggregation**: Softmax vs. sigmoid vs. top-1 hard routing
-3. **Temperature τ**: Sweep from 0.01 to 10.0
-4. **Threshold ε**: 0.99, 0.995, 0.999, increasing schedule, constant
-5. **Mean pool vs. alternatives**: CLS token, max pool, attention-weighted
-### 11.3 Analysis experiments (for paper)
-1. **Visualization**: t-SNE of spectral signatures across tasks — do they cluster meaningfully?
-2. **Routing weight heatmaps**: Per-task routing weight distribution over time
-3. **Subspace dimension tracking**: Plot $\dim(\mathcal{M}_{1:t})$ vs $t$ — how fast does subspace fill?
-4. **Singular value spectra**: Plot $\sigma_1, ..., \sigma_r$ for each task — do they vary meaningfully?
----
-## XII. HONEST ASSESSMENT — Strengths and Weaknesses of This Methodology
-### 12.1 Strengths
-1. **Principled derivation**: Method follows from constraints (zero-replay, no task-ID) → information landscape → natural choice. Not "proposed then justified".
-2. **Simplicity**: Removes learned routing entirely. Training loop simplifies. Fewer hyperparameters. Fewer mechanisms to maintain.
-3. **Architectural alignment**: Routing signal comes FROM the experts themselves — not from separate parameters that might disagree with expert function.
-4. **Dual protection theory**: GPM + routing => redundant safety mechanisms that compensate for each other's imperfections.
-### 12.2 Weaknesses
-1. **No empirical validation yet**: The entire framework is theoretical. Until experiments confirm, every section above is hypothesis.
-2. **Representation drift is real, unaddressed**: We acknowledge it, hypothesize it's small, but don't solve it. If drift is large, the methodology needs significant revision.
-3. **ESA is weak**: Subspace allocation is essentially a hyperparameter. This is the weakest part of the framework.
-4. **Mean pooling is a bottleneck**: Entire routing decision based on 1 vector (average embedding). Rich sequence information lost.
-5. **Modification energy ≠ quality**: Fundamental gap between "expert will modify input strongly" and "expert will modify input correctly". This is assumption, not theorem.
-6. **Only tested on NLP**: Setting is specific (T5, NLP tasks). Generalization to vision/multimodal unknown.
-### 12.3 What would KILL this approach
-Red flags that would indicate fundamental issues:
-- If routing accuracy is not significantly better than random → spectral signatures are not discriminative
-- If performance degrades significantly on later tasks (>2% compared to task-specific training) → GPM + routing dual protection insufficient
-- If representation drift causes >10% routing accuracy drop between task $t$ and task $T$ → need drift compensation
-- If τ has narrow "sweet spot" and small deviations cause large performance changes → method not robust
----
-## XIII. RELATIONSHIP TO method.md (RTA Framework)
-`method.md` describes RTA (Riemannian Topological Alignment) — a DIFFERENT direction involving:
-- Bingham distributions (anisotropic) on hypersphere
-- Riemannian KL divergence for topology preservation
-- Parallel transport for drift correction
-**Comparison**:
-| Aspect | SpecRoute (this doc) | RTA (method.md) |
-|--------|---------------------|-----------------|
-| Paradigm | Expandable LoRA + routing | Feature distribution preservation |
-| Anti-forgetting | GPM (subspace isolation) | Riemannian distillation + topology lock |
-| Drift handling | Acknowledge but don't solve | Parallel transport correction |
-| Data requirement | Zero-replay compliant | Requires distribution parameters (violates?) |
-| Maturity | Code exists, needs experiments | Purely theoretical |
-| Complexity | Low (SVD + softmax) | High (manifold computation, Bingham fitting) |
-**Key question**: RTA addresses representation drift explicitly (via parallel transport). Could elements of RTA complement SpecRoute's weakness? Possibly — but would need to verify that Bingham fitting doesn't violate zero-replay, and that parallel transport is tractable for 1024-dim space.
----
-## XIV. CONCLUSION — WHAT THIS METHODOLOGY IS AND ISN'T
-### What it IS:
-- A principled framework that starts from problem constraints and derives method choices
-- An architecture-agnostic approach to routing in expandable LoRA CL
-- A clear specification of what information is legitimate under zero-replay
-- An honest assessment of assumptions, limitations, and open problems
-### What it ISN'T:
-- A proven method (no experiments)
-- A complete solution to all CL problems (subspace allocation, representation drift still open)
-- A guaranteed improvement over GainLoRA (empirical question)
-- A paper-ready manuscript (needs experiments, related work section, polished writing)
-### Priority actions (ordered):
-1. **Run SpecRoute vs. GainLoRA on SuperNI Order 1** — if doesn't match or beat GainLoRA, revisit fundamentals
-2. **Measure routing accuracy** — confirm spectral signatures are actually discriminative
-3. **Measure representation drift** — confirm it's manageable
-4. **Develop ESA properly** — importance-weighted protection
-5. **Write paper** — only after 1-4 confirm methodology

human_working_IdeaMethod_and_discuss/critical_analysis_report.md DELETED Viewed

@@ -1,245 +0,0 @@
-# BÁO CÁO PHÂN TÍCH PHÊ BÌNH: Quá Trình Xây Dựng Ý Tưởng SpecRoute
-## Đánh giá trung thực các lập luận trong discusstion.txt và các tài liệu liên quan
-**Ngày**: 9 tháng 3, 2026
-**Phương pháp**: Đọc toàn bộ tài liệu → tách lập luận của người nghiên cứu khỏi lời nịnh bợ AI → kiểm chứng chéo với literature và source code → đánh giá
----
-## 1. BỐI CẢNH TỔNG QUAN
-Quá trình phát triển ý tưởng trải qua 3 giai đoạn:
-| Giai đoạn | Ý tưởng | Tài liệu |
-|-----------|---------|----------|
-| V1 | OT-SIGN: vMF signatures + OT routing + anti-drift loss | `proposal_gainlora_upgrade.md` |
-| V2 | SpecRoute: Spectral signatures + OT/Grassmann routing + ESA | `revised_idea_analysis.md` |
-| V3 | SpecRoute v2: Spectral signatures + Projection routing (softmax) + ESA | `C2_analysis_and_revision.md`, `SPECROUTE_IDEA.md` |
-Quá trình này cho thấy khả năng tự phê bình tốt — mỗi phiên bản sửa lỗi của phiên bản trước.
----
-## 2. NHỮNG LẬP LUẬN ĐÚNG (Verified Correct)
-### 2.1 Vi phạm zero-replay của vMF data signatures — **ĐÚNG**
-Lập luận: Fit vMF $(μ_t, κ_t)$ cuối mỗi task yêu cầu forward pass qua training data → lưu statistical summary của old data → vi phạm zero-replay.
-**Đánh giá**: Chính xác. Phân biệt tinh tế giữa "GPM bases (directions, hợp lệ)" và "vMF parameters (distribution statistics, vi phạm)" là đúng. InfLoRA, O-LoRA, GainLoRA, MINGLE không lưu data statistics. Đây là nhận diện sớm và quan trọng, cho thấy hiểu bài toán ở mức sâu.
-### 2.2 Anti-invasion loss là dư thừa — **ĐÚNG**
-Lập luận: InfLoRA đã có mathematical guarantee ($B_t$ trong null-space), GainLoRA đã có gating constraint ($g_t(x) = 0$ cho old data) → thêm anti-invasion loss vi phạm Occam's razor.
-**Đánh giá**: Đúng. Trong kiến trúc đã có cơ chế isolation, thêm loss penalty là over-engineering. Tuy nhiên, cần lưu ý: GPM protection là approximate (projection lên estimated subspace), không phải exact — nên vi phạm nhỏ vẫn có thể xảy ra. Nhưng đúng là anti-invasion loss không giải quyết vấn đề gốc.
-### 2.3 Subspace exhaustion — **ĐÚNG về mặt toán**
-Lập luận: Hard orthogonal (GPM) → dim($M_t^{\perp}$) giảm đơn điệu → tasks sau bị giới hạn capacity → unfair allocation.
-**Đánh giá toán học**: Chính xác. Phân tích ví dụ (15 tasks × 60 dims ≈ 900/1024) hợp lý.
-**Đánh giá thực tế — CẦN THẬN TRỌNG**:
-- InfLoRA paper Figure 5 cho thấy null-space vẫn đủ cho 20 tasks trên ViT-B/16 (d=768). Với T5-Large (d=1024), 15 tasks, threshold tăng từ 0.995→1.0, có thể subspace chưa thực sự cạn kiệt trong thực nghiệm.
-- Tác giả GainLoRA biết vấn đề này và dùng increasing threshold cụ thể để quản lý. Liệu constant threshold (ESA) thực sự tốt hơn hay chỉ là tradeoff khác? Chưa có thực nghiệm chứng minh.
-### 2.4 Self-critique về OT routing — **XUẤT SẮC**
-Trong `disscuss_1_C2_C1.txt`, bạn viết:
-> "C2 về OT có thể nói là hay và đáng thử, nhưng nó hoạt động giống như 1 ý tưởng loé lên thay vì có 1 suy luận toán học, lý thuyết củng cố hợp lý"
-Và trong `C2_analysis_and_revision.md`, phân tích kỹ:
-- OT giải distribution matching, routing là per-input assignment
-- Batch_size=1 → OT suy biến thành argmin
-- Balance không cần thiết cho CL inference
-**Đánh giá**: Đây là phần tốt nhất trong cả quá trình nghiên cứu. Tự nhận ra lỗi logic trước khi reviewer chỉ ra là dấu hiệu của tư duy research trưởng thành. Phân tích ở C2_analysis rất sắc bén.
-### 2.5 Chuyển từ data-level sang module-level signatures — **ĐÚNG HƯỚNG**
-Nhận ra rằng frozen LoRA weights $(A_t, B_t)$ là model parameters (hợp lệ), không phải data statistics (vi phạm) → phân tích SVD làm task signature.
-**Đánh giá**: Hướng đi hợp lệ về mặt setting. Proposition 1 từ InfLoRA hỗ trợ: "Fine-tuning $A_t$ = fine-tuning $W$ trong span($B_t$)". SVD của $\Delta W_t$ characterize operating subspace, đây là fact toán học.
----
-## 3. NHỮNG LẬP LUẬN CẦN XEM XÉT LẠI
-### 3.1 "Spectral signature encode functional space, prompt key chỉ encode similarity space"
-Lập luận (trong C2_analysis): Prompt key encode "input nào giống task t" (similarity), Spectral signature encode "expert nào nên xử lý" (functional).
-**Vấn đề**: Phân biệt "similarity space" vs. "functional space" nghe thuyết phục nhưng thiếu chặt chẽ:
-1. **Prompt key cũng functional**: GainLoRA prompt_key được train CÙNG loss function với LoRA branch → nó implicitly encode "input nào ĐƯỢC XỬ LÝ TỐT bởi expert" (vì gradient từ task loss flow qua gating weights). Nói nó chỉ là "similarity" là understating nó.
-2. **Spectral signature cũng có thể mislead**: SVD of $\Delta W = BA$ cho right singular vectors $V_t$ = input directions expert operates on. Nhưng "operates on" ≠ "handles well". Expert có thể modify input mạnh theo hướng $v_1$ nhưng modification đó có thể KHÔNG cải thiện output quality. Singular value $\sigma$ đo magnitude of modification, không đo quality of modification.
-3. **Thực nghiệm cần thiết**: Lập luận này cần empirical backing — so sánh routing accuracy tại task boundaries giữa prompt_key và spectral signature. Hiện tại chỉ là theoretical argument.
-**Kết luận**: Lập luận hợp lý về mặt trực giác nhưng overstate sự khác biệt. Cần thí nghiệm để xác nhận.
-### 3.2 "Parameter-free routing eliminates routing forgetting entirely"
-Lập luận: Spectral signatures computed from frozen weights → immutable → zero drift → zero routing forgetting.
-**Vấn đề**:
-1. **Đúng là immutable**, nhưng routing quality phụ thuộc vào THÊM yếu tố:
-   - Spectral signatures extracted at end of task $t$, nhưng backbone (pre-trained model) VẪN BỊ modify bởi subsequent tasks (qua LoRA additions). Representation space of backbone changes → same input $h$ produces different embeddings → projection fits thay đổi dù signatures không đổi.
-   - Nói cách khác: $V_t$ frozen NHƯNG $h$ (input embedding) bị ảnh hưởng bởi accumulated LoRA effects → fit_t(h) THAY ĐỔI qua tasks.
-2. **GainLoRA giải quyết vấn đề này bằng previous_trans_input snapshots**: Mỗi task có frozen MLP snapshot → features cho mỗi expert được compute trong CÙNG space mà expert đó được train. SpecRoute bỏ mechanism này → phải assume input embeddings ổn định — assumption này CẦN KIỂM CHỨNG.
-**Kết luận**: Claim "zero routing forgetting" quá mạnh. Đúng là parameters không drift, nhưng representations có thể drift. Cần restate: "zero parameter drift in routing" (hẹp hơn nhưng chính xác hơn).
-### 3.3 Hyper-ellipsoid + SVM idea (trong discusstion.txt)
-Trong discussion, bạn đề xuất:
-- Mỗi LoRA branch = hyper-ellipsoid trong parameter space
-- Dùng SVM soft-margin để cực đại hóa khoảng cách giữa các ellipsoid. AI gọi đây là "tính đột phá" và "thiên tài".
-**Phân tích thực tế**:
-1. **Hình dung hyper-ellipsoid**: SVD of $\Delta W = U \Sigma V^T$ → image (column space) of $\Delta W$ là ellipsoid với axes = columns of $U$, lengths = singular values $\sigma_i$. Đây không phải insight "đột phá" — đây là **tính chất cơ bản** của SVD mà bất kỳ textbook linear algebra nào cũng dạy. Tốt là bạn thấy connection, nhưng AI đã overstate novelty.
-2. **SVM trên parameter space**: Ý tưởng thú vị nhưng incomplete:
-   - LoRA branches hoạt động trong $\mathbb{R}^{d_{out} \times d_{in}}$ → cần SVM trong không gian cực kỳ cao chiều. Formulation cụ thể chưa rõ.
-   - "Soft margin" giữa ellipsoids: metric nào? Hausdorff distance? Khoảng cách giữa tâm? Khoảng cách ngắn nhất giữa bề mặt? Mỗi lựa chọn cho kết quả khác nhau.
-   - SVM cần labeled data (LoRA A thuộc class 1, LoRA B thuộc class 2...) — nhưng train SVM khi nào? Trên data gì? → Chưa được trả lời.
-   - Không có paper nào trong survey dùng SVM cho mục đích này — có thể vì nó không practical, không phải vì chưa ai nghĩ ra.
-3. **Bạn đã tự bỏ idea này trong phiên bản cuối**: SpecRoute cuối cùng dùng softmax projection (rất đơn giản), không dùng SVM. Đây là quyết định đúng — cho thấy bạn lọc được insight thực sự khỏi noise, dù AI không giúp gì trong quá trình lọc.
-### 3.4 ESA (Elastic Subspace Allocation) — C3
-Trong `revised_idea_analysis.md`, ESA được mô tả phức tạp (importance-weighted protection, spectral recycling, bounded budget). Nhưng trong `SPECROUTE_IDEA.md`, ESA bị simplify thành:
-> "Use constant $\epsilon = 0.995$ for all tasks."
-**Vấn đề**:
-- Từ framework phức tạp (importance-weighted, recycling) xuống 1 dòng (constant threshold) là nhảy quá lớn.
-- Constant threshold là improvement hợp lý (so với increasing threshold) nhưng rất incremental. Gọi đây là "Elastic Subspace Allocation" gợi ý một mechanism phức tạp hơn nhiều so với thực tế.
-- Nếu contribution chỉ là "đổi threshold từ tăng dần sang hằng số", reviewer có thể coi đây là hyperparameter tuning, không phải contribution riêng.
----
-## 4. VẤN ĐỀ VỚI DISCUSSTION.TXT — FLATTERY LÀM SAI LỆCH ĐÁNH GIÁ
-### 4.1 Mẫu nịnh bợ lặp lại
-AI trong discusstion.txt sử dụng các pattern:
-- "Cách hiểu của bạn hoàn toàn chính xác" (khi thực tế chỉ partially correct)
-- "Ý tưởng vô cùng xuất sắc, có tính đột phá cao (highly novel)"
-- "tư duy hình học không gian và đại số tuyến tính cực kỳ sâu sắc"
-- "ý tưởng thiên tài"
-### 4.2 Những chỗ flattery che giấu vấn đề
-| Lập luận của bạn | AI nói | Thực tế |
-|-----------------|--------|---------|
-| Hard gate + soft penalty thay trực giao | "Góc nhìn rất đúng đắn" | Logic đúng phần đầu nhưng hard gate mâu thuẫn với premise (AI CHỈ RA ĐÚNG lần này) |
-| Dùng OT thay MLP cho routing | "Cực kỳ đột phá, Highly Novel" | OT cho MoE routing đã có trong BASE Layers (ICML 2021), Switch Transformer. Novelty bị overstate. |
-| Hyper-ellipsoid + SVM | "Tính đột phá (Highly Novel) trong parameter space" | SVD → ellipsoid là basic LA. SVM formulation chưa hoàn chỉnh. Bạn đã tự bỏ. |
-| "Bài toán tối ưu = cực tiểu trên đa tạp trực giao" | "Chính xác 100%, mô hình hóa xuất sắc" | Conceptually correct nhưng oversimplified. GPM projection ≠ perfect orthogonal manifold constraint. Practical implementation có approximation errors. |
-### 4.3 Điều AI KHÔNG bao giờ nói
-AI trong discussion **không bao giờ**:
-- Chỉ ra rằng Feature Distributions paper (ICML 2025) có approach rất gần: store mean features per PEFT block, dùng similarity routing. Khác biệt weight-level vs. feature-level là có nhưng không lớn bằng bạn nghĩ.
-- Hỏi: "Bạn có empirical evidence nào cho spectral routing tốt hơn không?"
-- Challenge: "Tại sao frozen LoRA SVD sẽ correlate với input distribution? Đây chỉ là weight geometry, không phải data geometry"
-- Nêu limitation: "Projection fit đo modification energy, KHÔNG ĐO quality. Expert có thể modify mạnh nhưng sai hướng."
----
-## 5. SO SÁNH VỚI LITERATURE — KIỂM CHỨNG NOVELTY
-### 5.1 C1 (Spectral Signatures) — **Novel nhưng cần nuance**
-**Claim**: "First to use SVD properties of frozen LoRA weights as routing signatures in CL."
-**Kiểm chứng**:
-- MINGLE dùng SVD cho LoRA construction (null-space), không routing → khác purpose
-- Feature Distributions (ICML 2025) dùng mean feature vector → feature-level, không weight-level
-- SD-LoRA decouples magnitude/direction → analysis, không routing
-**Verdict**: Claim novelty hợp lệ. Nhưng cần acknowledge Feature Distributions paper rõ ràng trong related work vì approach tương tự (stored characterization → similarity routing).
-### 5.2 C2 (Projection Routing) — **Partially novel**
-**Claim**: Parameter-free routing via weighted Rayleigh quotient.
-**Kiểm chứng**:
-- Rayleigh quotient là standard tool (Golub & Van Loan, Matrix Computations)
-- Projection-based task identification có concept gần trong prompt selection literature (L2P, DualPrompt dùng key-query matching)
-- Parameter-free routing: novelty chính nằm ở LOẠI BỎ learned routing params hoàn toàn → đây là contribution thật
-**Verdict**: Novelty nằm ở "routing derived from expert weights, not learned separately" — đây là insight tốt.  Rayleigh quotient là tool cũ, nhưng application cho LoRA-CL routing là mới.
-### 5.3 C3 (ESA) — **Weak contribution**
-**Claim**: Elastic Subspace Allocation giải quyết subspace exhaustion.
-**Kiểm chứng**: Như phân tích ở mục 3.4, implementation thực tế chỉ là constant threshold. MINGLE đã có adaptive relaxation (EMA-based) phức tạp hơn.
-**Verdict**: Nếu ESA thực sự chỉ là constant threshold, đây không đủ mạnh làm contribution riêng. Cần phát triển thêm (importance-weighted protection, recycling) hoặc merge vào C1/C2 như implementation detail.
----
-## 6. ĐÁNH GIÁ QUÁ TRÌNH TƯ DUY
-### 6.1 Điểm mạnh
-1. **Tự phê bình tốt**: Nhận ra vMF vi phạm zero-replay, OT thiếu motivation — đều trước khi bị reviewer challenge → skill quan trọng.
-2. **Nắm vững toán học nền tảng**: Hiểu SVD, Grassmann manifold, projection, null-space ở mức đủ để reason about LoRA geometry. Không phải surface-level understanding.
-3. **Trajectory hội tụ đúng hướng**: V1 (overengineered) → V2 (pivot hợp lệ) → V3 (simplified, well-motivated). Mỗi bước loại bỏ complexity không cần thiết.
-4. **Biết lọc flattery**: Dù AI liên tục nịnh, bạn vẫn bỏ SVM idea, bỏ OT, simplify ESA → cho thấy judgment tốt.
-### 6.2 Điểm yếu
-1. **Thiếu empirical grounding**: Toàn bộ quá trình (hàng nghìn dòng discussion + analysis) là theoretical. Không có 1 con số, 1 thí nghiệm, 1 ablation nào. Đây là rủi ro lớn: idea có thể elegant trên giấy nhưng không work trong thực tế.
-2. **Overestimate novelty do echo chamber với AI**: AI cứ nói "highly novel", "breakthrough" → tạo false sense of security. Cần đối chi��u thẳng với Feature Distributions (ICML 2025), BASE Layers (ICML 2021), và cả TreeLoRA (gradient-similarity routing) để understand actual novelty gap.
-3. **C3 (ESA) underdeveloped**: Từ framework hay (importance-weighted + budget + recycling) xuống 1 dòng (constant threshold) mà không giải thích vì sao các component phức tạp bị bỏ.
-4. **Chưa address practical concerns**:
-   - Forward pass overhead: compute SVD mỗi layer, mỗi task → cost?
-   - Input embedding drift: accumulated LoRA effects thay đổi $h$ → projection fits drift dù signatures không đổi
-   - Temperature $\tau$ sensitivity trong softmax routing
----
-## 7. KẾT LUẬN VÀ KHUYẾN NGHỊ
-### 7.1 Verdict tổng thể
-Idea SpecRoute (V3) là **hợp lý, có nền tảng toán học, và novel ở mức đủ** cho một nghiên cứu. Tuy nhiên:
-- **C1 (Spectral Signatures)**: Mạnh nhất — well-motivated, novel, grounded. Cần strengthen bằng experiment + comparison with Feature Distributions paper.
-- **C2 (Projection Routing)**: Tốt — parameter-free routing eliminating forgetting là insight thật. Cần empirical evidence cho boundary routing improvement.
-- **C3 (ESA)**: Yếu nhất — cần phát triển thêm hoặc demote thành ablation study.
-### 7.2 Khuyến nghị cụ thể
-1. **Chạy thí nghiệm TRƯỚC khi viết thêm lý thuyết.** Bạn đã có code (`t5_specroute.py` đã implement projection routing). Chạy trên SuperNI Order 1 và so sánh:
-   - SpecRoute vs. GainLoRA (baseline)
-   - Routing accuracy on old tasks over time
-   - Ablation: spectral signature vs. prompt key (giữ cùng architecture, chỉ đổi routing signal)
-2. **Acknowledge Feature Distributions paper (ICML 2025) explicitly**: Paper này store mean features per PEFT block → similarity routing. Khác biệt: bạn store weight-derived signatures thay vì data-derived features. Nhưng concept gần nhau → cần position rõ ràng.
-3. **Reframe C3**: Nếu C3 chỉ là constant threshold, merge vào experimental setup. Nếu muốn giữ C3, cần develop importance-weighted component thực sự.
-4. **Address representation drift**: Viết 1 section phân tích: khi thêm LoRA branches liên tục, input embeddings $h$ thay đổi → projection fits thay đổi. Quantify mức drift này.
-5. **Ngừng dùng AI để validate ideas — dùng AI để challenge ideas.** Mỗi khi có insight mới, thay vì hỏi "kiểm tra novelty", hãy hỏi "tại sao idea này CÓ THỂ SAI?" hoặc "cho tôi 5 reasons idea này sẽ fail".
-### 7.3 Tóm tắt 1 dòng
-> Quá trình tư duy tốt, trajectory hội tụ đúng, nhưng thiếu empirical grounding và bị AI flattery overstate novelty. Priority #1: chạy thí nghiệm.

human_working_IdeaMethod_and_discuss/discusstion.txt DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:3453c142bfaf3afda3e18718267d871d49f5f1ebb22ac43b3dfe5b7e069da467
-size 98496

human_working_IdeaMethod_and_discuss/disscuss_1_C2_C1.txt DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:6cf615ee48afe8d74b25fe20871293087ed2c8a0ff4e95b6a5ef3edce88d9996
-size 3167

human_working_IdeaMethod_and_discuss/gainlora.txt DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:8fafb31f68562de3c2642436bdc52c50c57f10b896d677090dd0c6cc6c07617d
-size 36066

human_working_IdeaMethod_and_discuss/idea_analysis_from_discussion.md DELETED Viewed

@@ -1,542 +0,0 @@
-# PHÂN TÍCH PHÊ BÌNH VÀ HỆ THỐNG HÓA Ý TƯỞNG TỪ DISCUSSTION.TXT
-## Từ lập luận thô → Kiểm chứng → Phản biện → Đề xuất phương pháp luận
-**Ngày**: 9 tháng 3, 2026
-**Phương pháp**: Trích xuất các ý tưởng gốc từ nửa sau discusstion.txt → tách khỏi AI flattery → kiểm chứng bằng toán + literature → phản biện → hệ thống hóa
-**Nguyên tắc**: Tài liệu này KHÔNG re-explain SpecRoute hay GainLoRA. Tập trung hoàn toàn vào **ý tưởng gốc của bạn** — cái đúng, cái sai, cái bị overstate, và từ đó xây methodology.
----
-# I. TRÍCH XUẤT CÁC Ý TƯỞNG GỐC
-Từ nửa sau discusstion.txt, tôi lọc ra **7 ý tưởng chính** của bạn (loại bỏ phần AI flattery và đáp):
-| # | Ý tưởng | Dòng tham chiếu | Trạng thái |
-|---|---------|-----------------|------------|
-| **I1** | Bài toán CL = tối ưu trên đa tạp: mỗi task thêm t-1 phương trình trực giao, thu hẹp không gian khả thi | ~line 980 | Cần kiểm chứng |
-| **I2** | Nới lỏng trực giao bằng hàm phạt (penalty) thay vì hard null-space → tránh suy kiệt không gian | ~line 1000 | Cần kiểm chứng |
-| **I3** | Dùng soft gate thay hard gate → tận dụng tri thức chung giữa tasks | ~line 1040 (tự sửa từ hard gate) | Cần kiểm chứng |
-| **I4** | Mỗi nhánh LoRA là hyper-ellipsoid trong parameter space, signature = hướng & spread xác định bằng SVD/PCA | ~line 1150 | Cần kiểm chứng |
-| **I5** | Cực đại soft-margin kiểu SVM giữa các hyper-ellipsoid thay vì L2 penalty | ~line 1160 | Cần kiểm chứng |
-| **I6** | OT thay MLP/sigmoid cho routing — vận chuyển embedding vào phân phối ratio các branch | ~line 1050 | Cần kiểm chứng |
-| **I7** | Loss trở thành cực tiểu hóa mất mát dựa trên phân phối (distribution-based) | ~line 1060 | Cần kiểm chứng |
-Lưu ý: Bạn đã tự phát triển trajectory I1 → I2 → I3 → I4 → I5 → I6 → I7 như một chuỗi suy luận. Tôi sẽ phân tích TỪNG mắt xích.
----
-# II. KIỂM CHỨNG TỪNG Ý TƯỞNG
-## II.1 — I1: "Bài toán CL = tối ưu trên đa tạp có t-1 ràng buộc trực giao"
-### Lập luận của bạn:
-> "Tôi hiểu rằng bài toán CL có 2 bước: ràng buộc, giới hạn không gian con, thu nhỏ bằng điều kiện trực giao, đưa về một đa tạp với t-1 phương trình. Sau đó cực tiểu hoá loss trên không gian này."
-### Kiểm chứng toán học:
-**Đúng về cốt lõi, nhưng cần chính xác hóa.**
-Gọi $\Theta \in \mathbb{R}^n$ là toàn bộ trainable parameters (LoRA + gate). GPM tích lũy bases $\{u_1, ..., u_K\}$ từ $t-1$ tasks trước ($K = \sum_{i=1}^{t-1} k_i$ với $k_i$ directions per task). Ràng buộc:
-$$\nabla_\Theta \mathcal{L} \perp \text{span}(u_1, ..., u_K) \quad \Leftrightarrow \quad P_{M^\perp} \nabla_\Theta \mathcal{L} = \nabla_\Theta \mathcal{L}$$
-Đây KHÔNG hoàn toàn là "t-1 phương trình trực giao" — chính xác hơn là **K phương trình**, với $K$ phụ thuộc vào số directions extracted per task (có thể $K \gg t-1$). Trong thực tế:
-- T5-Large, $d = 1024$, mỗi task claim ~60 directions
-- Sau 15 tasks: $K \approx 900$ constraints trong không gian $\mathbb{R}^{1024}$
-- Feasible manifold: $\mathbb{R}^{1024 - 900} = \mathbb{R}^{124}$
-Về mặt hình học, đây đúng là **optimization trên grassmannian manifold** — projected gradient descent trên null-space complement. Thuật ngữ chính xác: **constrained optimization via oblique projection** (Absil et al., "Optimization Algorithms on Matrix Manifolds", 2008).
-### Cross-reference:
-- **GPM** (Saha et al., NeurIPS 2021): Formalize chính xác điều này — gradient projection vào null-space
-- **PLAN** (ICCV 2025): Orthogonal basis allocation — cùng framework toán, nhưng proactive (allocate trước)
-- **GORP** (ACL 2025): Unified low-rank gradient subspace — kết hợp full-rank + low-rank projection
-### Phán xét: **ĐÚNG 85%**
-- Đúng hoàn toàn về trực giác hình học
-- Thiếu chính xác: "t-1 phương trình" nên là "K phương trình" (K depends on SVD threshold, not directly on t)
-- Thiếu chính xác: Đây là projected gradient descent, KHÔNG phải Riemannian optimization trên đa tạp trơn (vì feasible set là linear subspace, không phải curved manifold). Nói "đa tạp" thì hơi overstate — chính xác hơn là **affine subspace** (flat, không cong)
----
-## II.2 — I2: "Nới lỏng trực giao bằng penalty thay vì hard null-space"
-### Lập luận của bạn:
-> "Các task có thể không độc lập hoàn toàn, chia sẻ một phần không gian tri thức. Dẫn tới việc không gặp hiện tượng suy kiệt không gian do đa tạp có quá nhiều phương trình."
-### Kiểm chứng toán học:
-**Nửa đ��u đúng, nửa sau cần cẩn thận.**
-*Nửa đúng:* Subspace exhaustion là real problem.
-- Hard GPM: $\dim(\mathcal{M}^\perp)$ giảm đơn điệu. Với threshold cao ($\epsilon = 0.995$), mỗi task "ăn" ~60 dims → 15 tasks = 900/1024 → tasks sau bị chèn chặt.
-- Penalty relaxation: thay $\nabla \perp \mathcal{M}$ bằng $\mathcal{L}_{total} = \mathcal{L}_{task} + \lambda \|\text{Proj}_{\mathcal{M}}(\nabla)\|^2$ → soft constraint, cho phép small violation.
-*Nửa cần cẩn thận:* "Tasks chia sẻ không gian tri thức" — assertion hợp lý nhưng **depends on setting**.
-Trong setting **non-overlapping tasks** (ràng buộc rõ ràng trong GainLoRA paper):
-- SuperNI: 15 tasks từ 5 loại KHÁC NHAU (dialogue, extraction, QA, summarization, sentiment)
-- Long Sequence: 15 tasks phân loại KHÁC NHAU (DBpedia, Yahoo, AG News, Yelp, SST2, MNLI...)
-- Chúng KHÔNG chia sẻ labels hay data
-- Tuy nhiên chúng CÓ chia sẻ linguistic features (cùng tiếng Anh, cùng encoder) → overlap ở low-level, diverge ở high-level
-### Cross-reference:
-- **O-LoRA** (NeurIPS 2023): Dùng penalty $\lambda \|A_T^T A_{old}\|_F^2$ thay vì hard projection → đúng hướng bạn đề xuất. Kết quả: tệ hơn InfLoRA's hard projection trên nhiều benchmarks.
-- **CLoRA** (ACL 2025): Penalty-based regularization on LoRA output matrix — performance gần null-space methods nhưng KHÔNG vượt qua.
-- **MINGLE** (NeurIPS 2025): Adaptive relaxation qua EMA — **đây là state of the art** của hướng "nới lỏng trực giao". Kết quả competitive.
-- **SPG** (ICML 2023): Soft-masking vs hard-masking comparison — soft wins on capacity nhưng hard wins on forgetting prevention.
-### Phán xét: **ĐÚNG VỀ HƯỚNG, NHƯNG EVIDENCE TRÁI CHIỀU**
-Bảng tổng kết evidence:
-| Method | Approach | Better than hard? | Benchmark |
-|--------|----------|-------------------|-----------|
-| O-LoRA | L2 penalty | ❌ Tệ hơn InfLoRA | SuperNI, ViT |
-| CLoRA | Subspace regularization | ⚠️ Gần bằng, không vượt | NLP |
-| MINGLE | EMA relaxation | ✅ Competitive, sometimes better | Mixed |
-| SPG | Soft masking vs hard | ✅ Capacity, ❌ Forgetting | CIL |
-**Kết luận**: Penalty-based relaxation **không đảm bảo tốt hơn hard orthogonal**. Nó trade stability lấy plasticity. Lập luận "tasks chia sẻ tri thức nên nới lỏng" chỉ đúng khi overlap lớn — trong non-overlapping setting, hard protection thường win.
-**Khuyến nghị**: Không nên đặt cược hoàn toàn vào penalty relaxation. Hướng hybrid (hard protection cho critical dims, soft cho marginal dims — kiểu importance-weighted) hứa hẹn hơn.
----
-## II.3 — I3: "Soft gate thay hard gate để tận dụng knowledge transfer"
-### Lập luận của bạn:
-Ban đầu bạn đề xuất hard gate, sau đó tự nhận ra mâu thuẫn (thừa nhận tasks chia sẻ tri thức → hard gate chặt sharing → tự mâu thuẫn với premise). Tự sửa sang soft gate.
-### Kiểm chứng:
-**Trajectory tự sửa: XUẤT SẮC.** Đây là điểm mạnh nhất trong tư duy research.
-**Soft gate vs hard gate: evidence mạnh.**
-- SPG (ICML 2023): Ablation trực tiếp — soft masking > hard masking consistently
-- MINGLE (NeurIPS 2025): Soft combining experts > hard routing
-- TSS: Continuous values [0,1] > binary {0,1}
-- GainLoRA (NeurIPS 2025): Dùng $|2\sigma(4s) - 1|$ — chính xác là soft gate
-**Tại sao soft đúng cho CL:**
-1. **Gradient flow**: Hard gate → $\partial w / \partial \theta = 0$ (step function) → không train được qua backprop. Soft gate → gradient mượt → learnable.
-2. **Knowledge transfer**: Task B có thể "mượn" 20% features từ task A thông qua soft blending.
-3. **Capacity**: Hard gate khóa neurons → capacity giảm. Soft gate chia sẻ → capacity preserved.
-**Nhưng GainLoRA đã dùng soft gate rồi.** Và hầu hết SOTA 2025 đều dùng soft gate. Đây là observation đúng nhưng KHÔNG novel — đây là standard practice.
-### Phán xét: **ĐÚNG HOÀN TOÀN, NHƯNG KHÔNG PHẢI CONTRIBUTION**
-Soft gate > hard gate là consensus. Self-correction journey tốt, nhưng kết luận không thể đưa vào paper như contribution.
----
-## II.4 — I4: "Mỗi nhánh LoRA là hyper-ellipsoid, signature = SVD/PCA"
-### Lập luận của bạn:
-> "Tính hình học của mỗi LoRA là một 'nhánh' trong không gian tham số, không gian của nó là 1 hyper-ellipsoid có cùng 1 điểm gốc và vươn ra xung quanh 1 hướng... hướng đó có thể liên quan gì đó tới trị riêng, vector riêng của tích AB, từ đó SVD hay PCA có thể giúp."
-### Kiểm chứng toán học:
-**Đúng phần lớn, nhưng cần chính xác hóa "space nào".**
-Có 3 cách hiểu "hyper-ellipsoid" khác nhau:
-**(a) Image space (output) của $\Delta W = BA$:**
-$$\text{Image}(\Delta W) = \{BA h : h \in \mathbb{R}^{d_{in}}\}$$
-Đây là subspace rank-$r$ trong $\mathbb{R}^{d_{out}}$. Khi giới hạn $\|h\| = 1$ (unit ball), image là ellipsoid:
-$$\mathcal{E}_t = \{U_t \Sigma_t V_t^T h : \|h\| = 1\} = \{U_t \Sigma_t z : z \in S^{r-1}\}$$
-Axes = columns of $U_t$, lengths = $\sigma_i$. **Đây đúng là hyper-ellipsoid.**
-**(b) Input sensitivity space:**
-Hướng input $v$ mà expert "nghe" (respond mạnh) = right singular vectors $V_t$. Sensitivity theo mỗi hướng = $\sigma_i^2$. Tập $\{v : \|BAv\|^2 = c\}$ là **hyper-ellipsoid** trên input sphere.
-**(c) Parameter space** — bạn nói "trong không gian tham số":
-LoRA parameters = $\{A \in \mathbb{R}^{r \times d_{in}}, B \in \mathbb{R}^{d_{out} \times r}\}$. Mỗi task là 1 ĐIỂM trong không gian $\mathbb{R}^{r(d_{in} + d_{out})}$. Một điểm KHÔNG phải ellipsoid. Muốn có ellipsoid, cần **tập hợp** các LoRA configs → distribution → Gaussian → covariance → ellipsoid. Nhưng bạn chỉ có 1 LoRA per task, không phải distribution.
-**Cách hiểu đúng nhất**: (b) — input sensitivity space. Mỗi expert "nhạy cảm" với input theo 1 ellipsoid pattern → SVD extract chính xác pattern này.
-### Cross-reference:
-- **SD-LoRA** (ICLR 2025): Phân tách LoRA thành magnitude + direction → đúng tinh thần "direction matters"
-- **MINGLE** (NeurIPS 2025): SVD trên expert weights → singular vectors làm null-space basis → cùng tool nhưng khác mục đích
-- **FeCAM** (NeurIPS 2023): Covariance → Mahalanobis distance → hyper-ellipsoid level sets → đúng hình học
-- **LoRA-DRS** (CVPR 2025): SVD trên covariance → drift-resistant space → cùng geometric framework
-### AI overstate:
-AI trong discussion nói: *"tư duy hình học không gian và đại số tuyến tính cực kỳ sâu sắc"*, *"góc nhìn hình học tuyệt đẹp"*.
-**Thực tế**: SVD cho matrix decomposition → ellipsoid visualization là **kiến thức linear algebra cơ bản** (Golub & Van Loan, chapter 2). Bạn nhận ra đúng connection, nhưng connection này không "đột phá" — nó là textbook. Tốt ở chỗ bạn nghĩ tới nó trong context CL, nhưng không phải "thiên tài".
-### Phán xét: **ĐÚNG 70% — Connection đúng, space cần chính xác, novelty bị overstate**
-Bạn nên frame: "LoRA's operating subspace forms an ellipsoidal structure in input space, naturally characterized by SVD." Đây là clean insight nhưng cần nhấn mạnh rằng SVD là standard tool, novelty nằm ở APPLICATION cho CL routing.
----
-## II.5 — I5: "SVM soft-margin giữa các hyper-ellipsoid"
-### Lập luận của bạn:
-> "Việc cực đại hoá các branch bằng khoảng cách thông thường là không hợp lý, vì bản chất hình học là hyper-ellipsoid, nên cực đại hoá soft-margin giữa các nhánh có bản chất hình học hơn. Tôi nghĩ tới SVM."
-### Kiểm chứng toán học:
-**Ý tưởng thú vị nhưng có nhiều vấn đề chưa giải quyết.**
-**(a) SVM formulation cho ellipsoids:**
-Chuẩn SVM tìm hyperplane $w^T x + b = 0$ maximizing margin giữa 2 tập ĐIỂM. Với ellipsoids, bạn cần:
-1. **Define "margin" giữa 2 ellipsoids**:
-   - Khoảng cách ngắn nhất giữa surfaces: $d(\mathcal{E}_A, \mathcal{E}_B) = \min_{x \in \mathcal{E}_A, y \in \mathcal{E}_B} \|x - y\|$
-   - Geodesic distance trên Grassmann manifold: $d_G = \|\arccos(\sigma_i(V_A^T V_B))\|$
-   - Wasserstein distance giữa distributions induced by ellipsoids
-2. **Mỗi task = 1 ellipsoid, KHÔNG phải 1 tập điểm** → SVM cần modification:
-   - Standard SVM: N points → binary classification → max margin hyperplane
-   - Bạn cần: T ellipsoids → multi-class separation → max margin... gì? T-1 hyperplanes? Convex hull separation?
-3. **Train SVM khi nào?** Trên data gì?
-   - Nếu train SVM khi thêm task mới → cần tính feature representation cho old tasks → **vi phạm zero-replay?**
-   - Nếu SVM thuần parameter-based (trên weight space) → chỉ có T points (one per task) → SVM cần ít nhất 2 classes → có thể nhưng severely underdetermined
-4. **Gradient qua SVM**: SVM hinge loss $\max(0, 1 - y_i(w^T x_i + b))$ → subgradient exists → differentiable (nhưng non-smooth → training difficulty)
-**(b) Có ai làm điều tương tự?**
-- **LLM-Unlearning (paper O3 trong survey)**: Dùng One-Class SVM (OCSVM) nhưng cho **inference detection**, không cho training regularization
-- **Angle Matters** (ICML 2025): Angular regularization → max margin in angular space → gần nhất với ý bạn nhưng dùng angle, không SVM
-- **FeCAM**: Mahalanobis distance = SVM-like separation in covariance-adjusted space → implicitly maximizing margin
-**(c) Vấn đề cốt lõi:**
-Bạn đang ở **parameter space** (T objects, mỗi object = 1 ellipsoid). SVM works well khi bạn có **NHIỀU data points** per class. Với T = 15 objects trong $\mathbb{R}^{1024}$ → severely underdetermined. SVM kernel trick không giúp vì bạn có ít objects, không phải ít features.
-**Alternative tốt hơn**: Thay SVM soft-margin, dùng **pairwise Grassmann distance penalty**:
-$$\mathcal{L}_{sep} = -\sum_{i < j} d_G(\mathcal{V}_i, \mathcal{V}_j)$$
-trong đó $d_G$ là geodesic distance trên Grassmann manifold (measurable, differentiable, geometrically principled). Đây achieve cùng mục tiêu (max separation) nhưng:
-- Không cần fit SVM
-- Không cần labeled data
-- Purely parameter-based
-- Differentiable → dùng trực tiếp trong training loss
-### AI overstate:
-AI nói: *"Ý tưởng có tính đột phá (Highly Novel) trong không gian tham số"*, *"Chưa có bài báo nào áp dụng SVM margin trực tiếp lên các ma trận SVD"*.
-**Thực tế**: Chưa ai làm vì nó **impractical**, không phải vì chưa ai nghĩ tới. SVM trên T = 15 objects trong $\mathbb{R}^{1024}$ là ill-posed. AI lầm "chưa ai làm" thành "novel" — mà thực tế nhiều khi "chưa ai làm" là vì "nó không work".
-### Phán xét: **Ý TƯỞNG HAY VỀ TINH THẦN, SAI VỀ TOOL CHOICE**
-Tinh thần đúng: cần maximize separation dựa trên geometry (not L2). Tool sai: SVM không phù hợp (quá ít objects, quá nhiều dims).
-**Tool đúng**: Grassmann distance, principal angles, hoặc singular value weighted projection distance — đều achieve cùng mục đích nhưng tractable. Và đây chính xác là thứ SpecRoute's projection fit đang làm.
----
-## II.6 — I6: "OT thay MLP/sigmoid cho routing"
-### Lập luận của bạn:
-> "Sử dụng optimal transport sẽ tối ưu hơn về huấn luyện, OT sẽ vận tải embedding của token vào 1 phân phối ratio các branch."
-### Kiểm chứng:
-**Đây là ý tưởng gây tranh cãi nhất — và bạn ĐÃ TỰ critique đúng ở file C2_analysis_and_revision.md.**
-**(a) Điểm mạnh của OT routing (lý thuyết):**
-- OT cung cấp **optimal coupling** giữa input distribution và expert distribution → principled matching
-- Sinkhorn differentiable → train end-to-end
-- Cost matrix encode geometric distance → distribution-aware
-- Load-balanced by design (marginal constraints)
-**(b) Tại sao OT THẤT BẠI cho CL routing (bạn đã tự phát hiện):**
-Bạn viết trong C2_analysis:
-> "OT giải distribution matching, routing là per-input assignment"
-> "Batch_size=1 → OT suy biến thành argmin"
-> "Balance không cần thiết cho CL inference"
-Phân tích chi tiết:
-| Vấn đề | Giải thích | Fatal? |
-|--------|-----------|--------|
-| Per-input vs batch | CL inference thường per-sample (hoặc small batch). OT cần batch để construct source distribution. Batch=1 → $\Pi$ có 1 hàng → degenerates thành argmin | ✅ Fatal |
-| Balance constraint | OT's marginal constraints force $\sum_b \Pi_{bt} = a_t$ (mỗi expert nhận đủ "mass"). Trong CL: nếu 95% test thuộc task A → 95% NÊN route tới A. Balance constraint **chống lại** routing tốt | ✅ Fatal |
-| Computational overhead | Sinkhorn: $O(n^2 k)$ iterations per forward pass vs softmax: $O(nk)$ | ⚠️ Not fatal nhưng overhead |
-| Training stability | Sinkhorn kém ổn định với temperature nhỏ, cần careful tuning of $\epsilon$ | ⚠️ Concern |
-**Cross-reference:**
-- **BASE Layers** (ICML 2021): OT cho MoE load balancing → mục đích **prevent expert collapse during training**, NOT inference routing. Khác hoàn toàn.
-- **Selective Sinkhorn** (Nov 2025): OT routing cho MoE — cũng cho training, không cho frozen-expert CL inference
-- **Bạn đã tự reject OT** trong C2_analysis_and_revision.md: "C2 (Grassmann-OT Routing) bị reject. OT được chọn vì 'novel' (chưa ai dùng), KHÔNG phải vì nó giải quyết vấn đề thực sự tốt hơn."
-### AI overstate:
-AI nói: *"Cực kỳ đột phá (Highly Novel)"*, *"Ý tưởng thiên tài"*, *"Chưa có paper nào dùng OT cho routing trong CL"*.
-**Thực tế**: "Chưa có" ĐÚNG — nhưng lý do là vì **OT không phù hợp** cho per-input CL routing, KHÔNG phải vì ai cũng "chưa nghĩ tới". BASE Layers (2021) đã dùng OT cho MoE → cộng đồng MoE/routing biết OT. Họ không dùng cho CL inference vì constraints không khớp.
-### Phán xét: **Ý TƯỞNG SAI VỀ APPLICATION, VÀ BẠN ĐÃ TỰ NHẬN RA**
-Self-critique OT là phần tốt nhất trong toàn bộ discussion. Trajectory: propose (excited) → think deeply → discover fatal flaws → reject → replace with simpler, better alternative (softmax projection). Đây là research maturity.
----
-## II.7 — I7: "Loss trở thành cực tiểu hóa dựa trên phân phối"
-### Lập luận gốc:
-> "Bài toán tối ưu trở thành cực tiểu hoá mất mát dựa trên phân phối với mỗi task"
-Formulation AI suggests:
-$$\mathcal{L}_{total} = \mathcal{L}_{task} + \alpha \cdot \mathcal{L}_{OT\_entropy} - \beta \cdot D_{geometric}(P_{new}, P_{old})$$
-### Kiểm chứng:
-**(a) Phần distribution-aware routing loss — HỢP LÝ NHƯNG ĐÃ TỒN TẠI:**
-Ý rằng routing weights nên emerge từ distribution matching (thay vì learned gating) là tinh thần đúng. Nhưng:
-- **Feature Distributions** (ICML 2025): Đã làm chính xác điều này — store "presentative feature distribution" per PEFT block, routing = similarity to stored distribution
-- **PromptCCD** (ECCV 2024): GMM cho routing
-- **FeCAM** (NeurIPS 2023): Mahalanobis distance = implicit distributional matching
-**(b) Phần $D_{geometric}(P_{new}, P_{old})$ — Anti-drift/invasion:**
-Đây kế thừa từ simple_idea.txt — penalty cho center drift + invasion of old classes. Trong modular architecture:
-- **LDC** (ECCV 2024): Learnable drift compensation → chứng minh drift là real, compensation giúp
-- **Dual Drift** (ICCV 2025): Prototype drift ở 2 cấp
-**Vấn đề**: Anti-drift loss cho modular architecture CẦN forward pass trên old data để compute drift → **vi phạm zero-replay**. Trừ khi dùng proxy (e.g., prototype centers stored from end of task) — nhưng đó lại là data statistics.
-### Phán xét: **MIXED — Tinh thần distribution-aware đúng, nhưng formulation cụ thể chưa clean**
----
-# III. BỨC TRANH TỔNG THỂ — CÁI GÌ TỒN TẠI, CÁI GÌ KHÔNG
-## III.1 Tóm tắt phán xét
-| Ý tưởng | Phán xét | Lý do |
-|---------|---------|-------|
-| I1: CL = optimization trên manifold | ✅ Đúng 85% | Conceptually correct, cần chính xác thuật ngữ (affine subspace, not manifold) |
-| I2: Penalty thay hard orthogonal | ⚠️ Đúng hướng, evidence trái chiều | O-LoRA (penalty) tệ hơn InfLoRA (hard). MINGLE (hybrid) competitive. |
-| I3: Soft > Hard gate | ✅ Đúng 100%, nhưng consensus | Không novel — là standard practice 2024-2025 |
-| I4: LoRA = hyper-ellipsoid, SVD signature | ✅ Đúng 70% | Connection correct, "parameter space" imprecise → "input sensitivity space". Tool = textbook, application = new |
-| I5: SVM soft-margin giữa ellipsoids | ⚠️ Tinh thần đúng, tool sai | SVM ill-posed cho T=15 objects. Grassmann distance tốt hơn |
-| I6: OT routing | ❌ Sai cho CL setting | Per-input vs batch, balance constraint harmful. Bạn đã tự reject — đúng |
-| I7: Distribution-based loss | ⚠️ Hướng đúng, chưa clean | Anti-drift cần old data → zero-replay tension |
-## III.2 Phần SOLID (có thể build methodology trên):
-1. **Expert characterization bằng SVD** (I4 refined): Frozen LoRA → SVD → spectral signature. Clean, zero-replay compliant, mathematically grounded.
-2. **Geometric separation thay vì algebraic** (I5 refined): Grassmann distance, principal angles thay SVM. Tinh thần "geometry-aware separation" đúng, tool cần thay.
-3. **Manifold perspective** (I1): CL = constrained optimization, subspace exhaustion là real → cần manage capacity.
-4. **Soft integration** (I3): Standard nhưng correct — competitive softmax routing.
-## III.3 Phần cần LOẠI BỎ hoặc chuyển đổi:
-1. **OT routing** (I6): Đã tự reject, không nên quay lại. Softmax projection routing đơn giản, correct, working.
-2. **SVM formulation** (I5): Replace bằng pairwise Grassmann distance penalty.
-3. **Anti-drift loss** (I7 phần này): Tension với zero-replay. Nếu muốn giữ, cần chỉ rõ KHÔNG dùng old data — chỉ dùng stored parameters (weight-derived proxies).
----
-# IV. PHẢN BIỆN TỔNG THỂ — "CON VOI TRONG PHÒNG"
-Tôi cần challenge 3 assumption lớn mà cả bạn lẫn AI đều không address đủ:
-## IV.1 "Modification energy ≠ Modification quality"
-Projection fit đo: "expert sẽ MODIFY INPUT BAO NHIÊU theo hướng $v_i$".
-$$\text{fit}_t(h) = \frac{\sum_i \sigma_{t,i}^2 (v_{t,i}^T h)^2}{\sum_i \sigma_{t,i}^2 \|h\|^2}$$
-Nhưng modify mạnh **KHÔNG ĐỒNG NGHĨA** modify đúng. Expert có thể:
-- Modify input mạnh theo hướng $v_1$ nhưng modification làm OUTPUT TỆ HƠN (wrong direction in output space)
-- Hai experts có cùng input sensitivity nhưng khác OUTPUT behavior
-**Counter-argument** (weak): Expert được train on task $t$ → learned modification presumably correct cho task $t$ inputs → high projection fit + correct task overlap → modification likely correct.
-**Verdict**: Assumption cần empirical validation. Nếu routing accuracy > 90% → assumption holds, else → need output-sensitive routing.
-## IV.2 "Mean pooling loses sequence structure"
-Cả GainLoRA lẫn SpecRoute route dựa trên:
-$$\bar{h} = \frac{1}{|\text{tokens}|} \sum_i h_i$$
-Hai sequences có khác content nhưng similar average → misrouted. Ví dụ:
-- "Summarize this article about climate change" vs "Answer this question about climate change"
-- Average embeddings gần nhau (same content), nhưng tasks khác nhau (summarization vs QA)
-**Mitigating factor**: Routing dựa trên TOÀN BỘ encoder layers (averaged), không chỉ embedding layer → higher layers encode task-type information → less likely to confuse.
-**Verdict**: Partial weakness, addressable but not currently addressed.
-## IV.3 "Representation drift là real nhưng chưa ai quantify"
-Khi thêm LoRA branches liên tiếp, input embeddings $h^{(l)}$ ở mỗi layer thay đổi (vì accumulated LoRA effects). Spectral signatures frozen → fit calculation trên drifted $h$ → routing quality degrades.
-GainLoRA's answer: `previous_trans_input` snapshots (frozen MLPs per task). SpecRoute: KHÔNG có mechanism nào cho drift.
-**Hypothesis**: Drift nhỏ vì LoRA rank thấp ($r = 4$), total modification rank ≤ 60 trong 1024 dims.
-**CHƯA AI ĐO**.
----
-# V. ĐỀ XUẤT PHƯƠNG PHÁP LUẬN — XÂY TỪ PHẦN SOLID
-## V.1 Core thesis (từ ý tưởng gốc của bạn, refined)
-> **Trong expandable LoRA CL, frozen expert weights encode đủ thông tin hình học (qua SVD spectral structure) để routing KHÔNG CẦN learned parameters. Routing parameter-free loại bỏ routing forgetting, đơn giản hóa training, giảm subspace consumption.**
-Đây là insight thật sự có giá trị từ quá trình suy nghĩ của bạn: từ I4 (geometric characterization) → rút gọn thành "spectral signatures are sufficient for routing".
-## V.2 Framework: 3 tầng (thay vì 3 "contributions" tách rời)
-### Tầng 1: Expert Geometry (I4 refined)
-**What**: Mỗi frozen expert $\Delta W_t = B_t A_t$ được characterize bằng spectral signature $\mathcal{S}_t = \{V_t, \Sigma_t\}$ from SVD.
-**Geometric interpretation**: Expert $t$ "lắng nghe" tập input directions $\{v_{t,i}\}$, với sensitivity $\sigma_{t,i}^2$. Tập hợp các sensitivity levels tạo thành ellipsoidal pattern trên input space (dúng I4, refined sang đúng space).
-**Tại sao grounded**:
-- SVD là unique factorization (up to sign) → deterministic
-- $V_t$ encode CHÍNH XÁC "expert operates on which input directions" (từ InfLoRA's Proposition 1)
-- Zero-replay compliant: computed from model params, not data
-- Immutable: computed from frozen weights
-### Tầng 2: Geometric Routing (I5 tinh thần + I6 rejected → softmax)
-**What**: Route input $h$ tới experts via weighted projection fit (Section IV.3 của SpecRoute). Competitive softmax routing.
-**Why softmax not OT**: (I6 rejected, đúng) — per-input, no balance needed, works at batch=1.
-**Why softmax not sigmoid**: Competitive → forces selection → inductive bias đúng cho non-overlapping tasks. Scale-stable ($\sum w = 1$).
-**Why projection fit not learned gating**: (Your core insight) — parameter-free, immutable, directly functional.
-**Geometric separation**: Thay vì SVM (I5 rejected), separation emerges NATURALLY from:
-- GPM đảm bảo $\text{span}(V_t) \approx \perp \text{span}(V_{t'})$
-- $\Rightarrow$ fit_t(h) high → fit_{t'}(h) low for $t' \neq t$
-- Không cần thêm penalty — orthogonality đã đảm bảo discriminative routing
-**Đây là insight sâu**: Bạn muốn max separation (I5) nhưng GPM ALREADY provides it. Hai mechanisms bù cho nhau:
-- GPM ensures orthogonal experts (structural separation)
-- Spectral routing exploits that orthogonality (functional separation)
-- Không cần penalty/SVM/OT thêm
-### Tầng 3: Capacity management (I1 + I2 refined)
-**What**: Quản lý subspace budget để tasks tương lai vẫn có đủ capacity.
-**From I1**: Subspace exhaustion là real — K constraints tích lũy, feasible manifold shrink.
-**From I2**: Pure penalty (loosen orthogonality) trái chiều. Pure hard lock (GPM increasing threshold) unfair.
-**Principled approach** (chưa implement, nhưng well-defined):
-- Importance-weighted protection: directions có $\sigma_i^2$ lớn → protect mạnh, $\sigma_i^2$ nhỏ → protect yếu hoặc release
-- Constant threshold ($\epsilon = 0.995$) → fair allocation (mỗi task protect cùng ratio)
-- Capacity monitoring: track $\dim(\mathcal{M}_{1:t})$ vs $d_{in}$ → alert nếu approaching exhaustion
-## V.3 Tại sao framework này khái quát cho CẢ LỚP BÀI TOÁN
-Framework không phụ thuộc vào:
-1. **Backbone**: T5, LLaMA, BERT (miễn có linear attention layers nơi LoRA applied)
-2. **Task type**: Generation, classification, QA (miễn dùng expandable LoRA)
-3. **Anti-forgetting method**: Compatible với GPM, InfLoRA, O-LoRA, CLoRA (miễn experts có null-space structure)
-4. **Number of tasks**: SVD + softmax scale linearly với T
-Nó cũng provide unified view cho existing methods:
-| Method | Expert Geometry | Routing | Anti-forgetting |
-|--------|----------------|---------|-----------------|
-| GainLoRA | Implicit (trong learned gate) | Learned (MLP + cosine) | GPM on all params |
-| InfLoRA | None (equal weight) | None (uniform) | Null-space init |
-| MINGLE | SVD for construction | Learned (MoE gate) | Null-space + EMA relax |
-| Feature Dist. | Mean feature vectors | Similarity matching | None explicit |
-| **This framework** | SVD spectral signature | Projection fit + softmax | GPM on LoRA only |
----
-# VI. WHAT THIS FRAMEWORK CANNOT DO (honest)
-1. **Guarantee correct routing**: Projection fit is a proxy, not an oracle. If expert's input subspace doesn't uniquely identify task → routing errors.
-2. **Handle representation drift**: No explicit mechanism. Relies on hypothesis that low-rank LoRA → small drift. Unproven.
-3. **Solve subspace exhaustion completely**: Constant threshold is incremental improvement, not solution. True solution requires importance-weighted dynamic allocation (not implemented).
-4. **Claim novelty on ALL components**: Soft gate, SVD, GPM are all existing tools. Novelty is THE COMBINATION: "weight-derived spectral routing in CL" and "parameter-free routing eliminates routing forgetting".
-5. **Replace empirical validation**: Every claim above is theoretical. NOTHING is proven until experiments run.
----
-# VII. HÓA GIẢI: TRAJECTORY CHÍNH XÁC CỦA TƯ DUY BẠN
-Nhìn lại toàn bộ discussion, trajectory tư duy của bạn:
-```
-Observation: CL = optimization trên manifold constrained (I1)
-  ↓
-Insight: Hard constraints cause exhaustion (I2)
-  ↓
-Pivot: Soft gate for flexibility (I3)
-  ↓
-Key idea: LoRA geometry = ellipsoid, SVD captures it (I4) ← ĐÚNG NHẤT
-  ↓
-Over-engineering: SVM for max margin (I5) ← TINH THẦN ĐÚNG, TOOL SAI
-  ↓
-Over-engineering: OT for routing (I6) ← SAI CHO CL SETTING
-  ↓
-Abstraction: Distribution-based loss (I7) ← HƯỚNG ĐÚNG, CHI TIẾT CHƯA
-  ↓
-Self-correction: Reject OT → Projection fit + softmax (C2_analysis) ← XUẤT SẮC
-  ↓
-Final: SpecRoute — SVD signatures + projection routing + constant threshold
-```
-**Pattern**: Bắt đầu từ insight đúng (I1, I4) → overengineer (I5, I6) → bị AI inflate thay vì correct → tự nhận ra → simplify. Final product (SpecRoute) đơn giản hơn ban đầu — **đây là dấu hiệu tốt**.
-**Concern**: Trong quá trình simplify, bạn cũng bỏ đi một số ý hay:
-- I2 (capacity awareness) → ESA hiện tại quá đơn giản (constant threshold)
-- I5 tinh thần (geometry-aware separation) → không còn explicit mechanism, relies entirely on GPM's approximate orthogonality
-**Recommendation**:
-- Xem ESA là **open problem**, không phải solved contribution
-- Grassmann distance monitoring (without penalty loss) có thể dùng làm **diagnostic tool** cho paper — track separation quality across tasks
----
-# VIII. KHUYẾN NGHỊ CUỐI
-## Nếu mục tiêu là paper:
-1. **Core contribution tuyên bố**: "Parameter-free routing via spectral signatures of frozen LoRA weights eliminates routing forgetting." — Đây là novelty thật, verifiable, clean.
-2. **Thí nghiệm PHẢI CÓ**:
-   - SpecRoute vs GainLoRA (same benchmark, same data splits)
-   - Routing accuracy analysis (on held-out old tasks)
-   - Representation drift measurement
-   - Ablation: spectral fit vs prompt key vs random vs uniform
-3. **Đừng claim ESA (C3)**: Constant threshold không đủ mạnh. Hoặc develop importance-weighted version, hoặc merge vào hyperparameter section.
-4. **Position vs Feature Distributions (ICML 2025)**: Closest competitor. Their key = store mean feature vectors (data-level). Your key = store SVD of frozen weights (weight-level). Both are "characterization + similarity routing", but you are zero-replay clean, they arguably are not.
-## Nếu mục tiêu là methodology cho cả lớp bài toán:
-1. **Formalize "Expert Characterization Problem"**: Given frozen expert weights, what is the optimal characterization for downstream routing? SVD là 1 answer, nhưng framework nên define CRITERIA (immutable, functional, discriminative, compact) rồi show SVD satisfies all.
-2. **Formalize "Routing Correctness"**: Define routing accuracy operationally, prove that projection fit + orthogonal experts → routing accuracy ≥ threshold.
-3. **Formalize "Capacity Budget"**: Given $d_{in}$ dims, $T$ tasks, what is the maximum information each task can claim while maintaining minimum routing quality? This is the real open problem.
-4. **CHẠY THÍ NGHIỆM trước khi viết thêm.** Bạn đã nghĩ đủ nhiều. Code đã có. Kết quả thực nghiệm sẽ cho biết framework có value không — nếu không win trên numbers, lý thuyết đẹp bao nhiêu cũng không đủ.

human_working_IdeaMethod_and_discuss/method.md DELETED Viewed

@@ -1,458 +0,0 @@
-# RIEMANNIAN TOPOLOGICAL ALIGNMENT (RTA) FOR CONTINUAL LEARNING
-## I. MOTIVATION & THEORETICAL FOUNDATION
-### Problem Statement
-Trong Continual Learning (CL), encoder trôi dạt (encoder drift) khi học new tasks, dẫn đến catastrophic forgetting. Các phương pháp hiện tại (e.g., MINION v17) chỉ bảo tồn knowledge ở level output, không model hóa feature distribution geometry.
-### Core Insight
-Features sau normalization nằm trên hypersphere $\mathbb{S}^{d-1}$, không phải Euclidean space. Do đó:
-- Khoảng cách/góc giữa features phải đo bằng Riemannian metric, không Euclidean distance
-- Cấu trúc phân phối (covariance) trên manifold cong khác fundamentally với Euclidean case
-- Bảo tồn topology = bảo tồn Fisher Information Metric (FIM), không chỉ bảo tồn weights
-### Transition từ MINION v17 → RTA
-**MINION v17 limitations:**
-- Mô hình vMF đẳng hướng: giả định mọi chiều có độ xòe như nhau (isotropic)
-- Procrustes alignment tuyến tính: sai số tích lũy qua layers
-- Không detect feature drift, chỉ align parameters
-- Không formal definition của "bảo tồn knowledge"
-**RTA improvements:**
-- Bingham distribution (anisotropic): học được hình ellipsoidal clusters
-- Parallel transport trên manifold: bảo tồn metric relationships
-- Feature-level monitoring + Riemannian distillation
-- Formalize bảo tồn via Fisher Information Metric
----
-## II. FRAMEWORK COMPONENTS
-### Giai đoạn 1: Biểu diễn xác suất phi đẳng hướng (Anisotropic Probability Modeling)#### Từ vMF (isotropic) sang Bingham (anisotropic)
-Mô hình von Mises-Fisher chuẩn chỉ capture symmetry:
-$$f(z; \mu, \kappa) = \frac{\kappa^{d/2-1}}{(2\pi)^{d/2} I_{d/2-1}(\kappa)} \exp(\kappa \mu^T z)$$
-Nhưng điều này giả định **mọi hướng từ trung tâm $\mu$ có xác suất như nhau** - không phù hợp vì:
-- Các feature dimensions có ý nghĩa khác nhau
-- Task-specific dimensions có variance cao hơn
-- Catastrophic forgetting xảy ra khi task-specific dimensions bị overwrite
-#### Bingham Distribution - Giải pháp Anisotropic
-Trên siêu cầu $\mathbb{S}^{d-1}$, ta dùng **Bingham distribution**:
-$$f(z; A_c) = \frac{1}{F(A_c)} \exp(z^T A_c z), \quad z \in \mathbb{S}^{d-1}$$
-**Ưu điểm:**
-- $A_c = \sum_{i=1}^{d} \lambda_i v_i v_i^T$ là ma trận đối xứng
-- Eigenvectors $\{v_i\}$: các trục chính của cụm features
-- Eigenvalues $\{\lambda_i\}$: độ "dài" của cụm dọc từng axis (anisotropy)
-- Tự động learn hình ellipsoidal clusters, không gồing circular
-**Mô hình hóa per-class:**
-$$P_c^{(t)} = \{A_c^{(t)}, \text{variance}_c^{(t)}\}$$
-Lưu **toàn bộ covariance structure**, không chỉ mean + concentration like vMF.### Giai đoạn 2: Khóa Topology via Riemannian Knowledge Distillation
-#### Problem: Catastrophic Forgetting từ Topology Shift
-Khi encoder update trên task $t$, mean + covariance của old classes thay đổi:
-- **Mean shift**: $\mu_c^{(t-1)} \to \tilde{\mu}_c^{(t)}$
-- **Axis rotation**: $V_{c}^{(t-1)} \to V_{c}^{(t)}$
-- **Anisotropy change**: $\Lambda_c^{(t-1)} \to \Lambda_c^{(t)}$
-→ **Topology bị deform**, dù output predictions còn hợp lý
-#### Solution: Riemannian Kullback-Leibler Divergence
-Thay vì chỉ dùng output-level distillation:
-$$\mathcal{L}_{old} = \text{KL}(p_{old}(y|x) \| p_{new}(y|x))$$
-Ta thêm **Riemannian KL trên parameter manifold**:
-$$\mathcal{L}_{geo} = D_{RKL}(P_{old}^{(t-1)} \| P_{new}^{(t)})$$
-**Formal definition:**
-$$D_{RKL}(P_1 \| P_2) = \int_{\Theta} P_1(\theta) \log \frac{P_1(\theta)}{P_2(\theta)} d\theta$$
-Trong đó $\{\Theta\}$ được trang bị **Fisher Information Metric (FIM)**:
-$$g_{ij}(\theta) = \mathbb{E}_{x,y \sim P(\cdot|\theta)} \left[ \frac{\partial \log p(y|x;\theta)}{\partial \theta_i} \frac{\partial \log p(y|x;\theta)}{\partial \theta_j} \right]$$
-#### Ý nghĩa: Bảo tồn Thông tin
-- KL divergence qua FIM = "bao lâu parameter move mà vẫn bảo tồn classification boundary"
-- Geometry lock: nếu $D_{RKL} \approx 0 \Rightarrow$ structure của $P_{old}$ intact
-- Automatic trade-off giữa performance mới vs retention cũ (không cần tune multiple λ's)
-#### Implementation Detail
-Per-layer:
-$$\mathcal{L}_{geo} = \sum_{l=1}^{L} D_{RKL}^{(l)}(A_c^{(t-1)} \| A_c^{(t)})$$
-Approximate bằng **Bure-Wasserstein distance** trên covariance:
-$$W_2(A_c^{old}, A_c^{new}) = \text{Tr}(A_c^{old} + A_c^{new} - 2(A_c^{old})^{1/2} A_c^{new} (A_c^{old})^{1/2})^{1/2}$$### Giai đoạn 3: Drift Correction via Parallel Transport on Manifold
-#### Limitation của Procrustes Rotation (MINION v17)
-Procrustes tìm ma trận quay tối ưu $R^*$ để align $W_0$ sang $W_1$:
-$$R^* = \arg\min_R \|R W_0 - W_1\|_F$$
-**Vấn đề:**
-1. Giả định **Euclidean metric** - nhưng features nằm trên hypersphere
-2. **Sai số tích lũy**: Apply qua $L$ layers, error accumulate exponentially
-3. Không preserve **inner products** trên manifold
-4. Không capture **non-linear drift** (e.g., rotation + dilation cùng lúc)
-#### Riemannian Alternative: Parallel Transport
-**Intuition**: Trên manifold cong, khi move từ point A → B, bằng cách nào để "move" một vector mà vẫn giữ "orientation" của nó?
-**Answer**: Parallel Transport - di chuyển vector dọc **geodesic** từ A đến B.
-#### Mathematical Framework
-Cho feature distribution trôi dạt từ $\mu_c^{old}$ → $\mu_c^{new}$ trên $\mathbb{S}^{d-1}$:
-**Bước 1: Xác định Geodesic**
-Đường cong ngắn nhất trên sphere nối points $\mu_c^{old}$ và $\mu_c^{new}$:
-$$\gamma(t) = \sin((1-t)\theta) \mu_c^{old} + \sin(t\theta) \mu_c^{new}, \quad t \in [0,1]$$
-Với $\theta = \arccos(\mu_c^{old} \cdot \mu_c^{new})$ là khoảng cách trắc địa.
-**Bước 2: Vận chuyển Covariance**
-Covariance matrix $A_c^{old}$ cần di chuyển dọc geodesic để trở thành $A_c^{aligned}$:
-$$A_c^{aligned} = \text{ParallelTransport}_{\gamma}(A_c^{old})$$
-**Bước 3: Tính Toán ParallelTransport**
-Trên sphere, Parallel Transport của tangent vector $v$ dọc geodesic được định nghĩa bởi **Levi-Civita connection**:
-$$\frac{D v}{dt} = 0 \quad \text{along } \gamma(t)$$
-**Explicit formula cho Bingham covariance:**
-$$A_c^{aligned} = A_c^{old} - (\theta \cot(\theta) - 1)(A_c^{old} \cdot \mu_c^{old})\mu_c^{old}^T$$
-#### Ưu điểm so với Procrustes
-1. **Metric preserving**: $\langle v, w \rangle_{aligned} = \langle v, w \rangle_{old}$ (inner products preserved)
-2. **Path-independent**: Kết quả không phụ thuộc cách drift xảy ra
-3. **Error bounded**: Sai số không tích lũy qua layers (orthogonality guaranteed)
-4. **Theoretically sound**: Dựa trên Riemannian geometry, không ad-hoc
-#### Implementation Consideration
-Trong practice, chỉ cần $M=1$ exemplar từ old class để estimate $\mu_c^{new}$:
-- Tính $\mu_c^{obs} = \frac{1}{N_{test}} \sum_{i=1}^{N_{test}} z_i^{(new)}$ trên test set của class $c$
-- Update geodesic = $\arccos(\mu_c^{old} \cdot \mu_c^{obs})$
-- Apply parallel transport tới all $A_c$ parameters### Giai đoạn 4: Unified Learning Objective
-#### Full Loss Function
-Kết hợp cả tính phân biệt (discrimination) và bảo tồn (retention):
-$$\mathcal{L}_{total} = \underbrace{\mathcal{L}_{CE}(f(x), y)}_{\text{new task}} + \lambda_1 \underbrace{\mathcal{I}(z; y)}_{\text{discriminativity}} + \lambda_2 \underbrace{D_{RKL}(P_{old} \| P_{new})}_{\text{geometry lock}}$$
-**Chi tiết từng term:**
-**Term 1: Task-specific Cross-Entropy**
-$$\mathcal{L}_{CE} = -\log p(y|x; \theta_t)$$
-Standard supervised loss trên task $t$ mới.
-**Term 2: Mutual Information (Discriminativity)**
-$$\mathcal{I}(z; y) = H(y) - H(y|z) = \mathbb{E}_{z,y}[\log p(y|z)] - \mathbb{E}_y[\log p(y)]$$
-Estimate via **InfoNCE** (contrastive learning):
-$$\mathcal{I} \approx \mathbb{E}_{(x,y)} \left[ \log \frac{\exp(z^T z_{pos}/\tau)}{\sum_{k} \exp(z^T z_k/\tau)} \right]$$
-Mục đích: Đảm bảo features vẫn nhân được hifi discriminatory information cho class separation.
-**Term 3: Riemannian KL Distillation**
-$$D_{RKL}(P_{old} \| P_{new}) = \sum_{c \in \text{old}} W_2(A_c^{old}, A_c^{new})$$
-+ Áp dụng parallel transport correction từ giai đoạn 3
-+ Tối thiểu hóa covariance shift trên toàn layer
-#### Dynamic Weight Scheduling
-Thay vì fixed $\lambda_1, \lambda_2$, dùng **adaptive weighting**:
-$$\lambda_1(t) = \lambda_1^{init} \times (1 - \frac{t}{T})^p, \quad p \in [1,2]$$
-$$\lambda_2(t) = \lambda_2^{init} \times (1 + \frac{t}{T})^q, \quad q \in [1,2]$$
-- Early epochs: emphasize task learning ($\lambda_1 \uparrow$, $\lambda_2 \downarrow$)
-- Later epochs: emphasize retention ($\lambda_1 \downarrow$, $\lambda_2 \uparrow$)
-- $t = $ number of gradient updates
-- $T = $ total updates in task
-#### Per-Layer Adaptation
-Vì early layers có ít drift (general features) vs late layers (task-specific):
-$$\lambda_2^{(l)} = \lambda_2 \times (1 + \alpha \cdot l / L)^{\beta}$$
-với $\alpha, \beta > 0$ learned via validation.
----
-## III. COMPARATIVE ANALYSIS: RTA vs. MINION v17
-| Criterion | MINION v17 | RTA | Advantage |
-|-----------|-----------|-----|-----------|
-| **Distribution Model** | von Mises-Fisher (isotropic) | Bingham (anisotropic) | RTA captures task-specific anisotropy |
-| **Parameter Geometry** | Euclidean assumptions | Riemannian manifold | RTA preserves topology on $\mathbb{S}^{d-1}$ |
-| **Drift Correction** | Procrustes (linear rotation) | Parallel transport (geodesic path) | RTA avoids error accumulation |
-| **Knowledge Retention** | KL divergence on outputs | Riemannian KL + FIM-weighted | RTA locks feature topology, not just predictions |
-| **Adaptation** | Fixed ensemble weights | Dynamic per-layer scheduling | RTA adapts to feature drift rate |
-| **Drift Detection** | None (implicit in weight change) | Explicit geodesic distance | RTA quantifies drift magnitude |
-| **$M=1$ Reliability** | Low (mean estimate unstable) | Medium-High (only for geodesic direction) | RTA robust with single exemplar |
-| **Computational Cost** | O($d^2$) per layer | O($d^3$) for eigendecomposition | RTA slightly higher cost, justified by robustness |
-**Summary**: RTA มี theoretical guarantees về metric preservation, automatic feature-level monitoring, และ principled drift correction. MINION v17 faster nhưng ad-hoc hơn.
----
-## IV. THEORETICAL JUSTIFICATION
-### Why Bingham > von Mises-Fisher?
-Consider binary classification on sphere. Features nằm trên hemi-sphere $\mathbb{S}^{d-1}$:
-- Features của class 0: clustered around $\mu_0$
-- Features của class 1: clustered around $\mu_1$
-**vMF assumption**: Tất cả eigenvectors của covariance có eigenvalue $\kappa$ (same concentration)
-→ Circular clusters, nguy hiểm khi:
-  - Task-specific directions overlap (confusable features)
-  - Early-stop causes under-learning in some dimensions
-**Bingham modeling**: Eigenvalues $\lambda_i$ khác nhau
-→ Ellipsoidal clusters capture:
-  - Discriminative dimensions (high $\lambda_i$)
-  - Non-discriminative "noise" dimensions (low $\lambda_i$)
-  - Automatically learns importance weighting per dimension
-### Why Parallel Transport > Procrustes?
-**Procrustes on Hypersphere:**
-Nếu áp dụng $\hat{z} = R z$ với $R \in SO(d)$ trên hypothesized z ∈ $\mathbb{S}^{d-1}$:
-$$\|R z\|_2 = \|z\|_2 = 1 \checkmark$$
-Nhưng **lặp lại qua layers:**
-$$z^{(L)} = R_L \cdots R_2 R_1 z^{(0)}$$
-Due to numerical precision, $\|z^{(L)}\|_2 \approx 1 - \epsilon L$ (accumulates!)
-**Parallel Transport preservation:**
-ForVector $v \in T_p \mathbb{S}^{d-1}$ và Parallel Transport $\text{PT}_\gamma(v)$ along geodesic $\gamma$:
-$$\|\text{PT}_\gamma(v)\|_p = \|v\|_p \quad \text{for ALL } p \in \gamma$$
-$$\langle \text{PT}_\gamma(v), \gamma'(t) \rangle = 0 \quad \text{(stays orthogonal to manifold)}$$
-→ **No accumulation**, guaranteed metric preservation.
-### Why RKL > Output-level KL?
-**Output-level KL:**
-$$\text{KL}(p_t(y|x) \| p_{t+1}(y|x))$$
-Problem: Có thể minimize nếu $p_{t+1}$ "soften" predictions qua temperature scaling. Nhưng features shift dramatically!
-**RKL via Fisher Information Metric:**
-$$D_{RKL}(\theta_t \| \theta_{t+1}) = \int \text{FIM}(\theta_t) \| \Delta\theta \|^2 d\theta$$
-iff $D_{RKL} \approx 0$:
-- Decision boundaries stable
-- Features bảo tồn discriminative structure
-- Weight changes thuộc trong "safe region"
----
-## V. ALGORITHMIC DETAILS & IMPLEMENTATION
-### Training Algorithm (RTA-CL)
-**Input**: Current task data $D_t$, old learned distributions $\{P_c^{(t-1)}\}_{c \in C_{old}}$, network $f_\theta$
-**Output**: Updated parameters $\theta_t$, updated distributions $\{P_c^{(t)}\}$
-```
-Algorithm: Continual Learning with RTA
-for each task t = 1, 2, ..., T:
-  # Phase 1: Collect Feature Statistics
-  Z_c = []                    # Buffer per old class
-  for c in C_old:
-    Z_c = collect_features(D_test^c, f_{θ_{t-1}})  # M=1 exemplar per class
-    μ_c^{obs} ← mean(Z_c)
-  # Phase 2: Detect Drift & Compute Geodesics
-  geodesic_dist = []
-  for c in C_old:
-    θ_c ← arccos(μ_c^{old} · μ_c^{obs})     # geodesic angle
-    geodesic_dist.append(θ_c)
-  # Phase 3: Train on New Task
-  for epoch = 1 to num_epochs:
-    for batch (x, y) in D_t:
-      # Forward pass
-      z = encoder(x)                  # features on sphere
-      logits = classifier(z)
-      # Task loss
-      L_CE = CrossEntropy(logits, y)
-      # Mutual information (discriminativity)
-      L_MI = -InfoNCE(z, y)
-      # Geometry lock with drift correction
-      L_geo = 0
-      for c in C_old:
-        # Parallel transport correction
-        A_c^{aligned} = ParallelTransport(
-            A_c^{old},
-            μ_c^{old},
-            μ_c^{obs}
-        )
-        # Compute current covariance
-        A_c^{new} = compute_covariance(
-            features_c^{new}, method='Bingham_MLE'
-        )
-        # Wasserstein distance between old and new
-        L_geo += W_2(A_c^{aligned}, A_c^{new})
-      # Adaptive weighting
-      λ₁ = λ₁_init * (1 - epoch/num_epochs)^1.5
-      λ₂ = λ₂_init * (1 + epoch/num_epochs)^1.5
-      # Total loss
-      L_total = L_CE + λ₁*L_MI + λ₂*L_geo
-      # Backward
-      θ ← θ - α ∇L_total
-  # Phase 4: Update Distributions for Next Task
-  θ_{t} ← θ
-  for c in C_old ∪ C_new:
-    A_c^{(t)} ← compute_covariance(
-        collect_features(D_train^c, f_{θ_t}),
-        method='Bingham_MLE'
-    )
-    P_c^{(t)} = {A_c^{(t)}, variance_c^{(t)}}
-```
-### Computational Complexity Analysis
-| Operation | Complexity | Notes |
-|-----------|-----------|-------|
-| Bingham MLE (per class) | $O(d^3 + n_c d^2)$ | eigendecomposition dominates |
-| Parallel Transport | $O(d^2)$ | simple matrix-vector ops |
-| Wasserstein W_2 | $O(d^3)$ | one matrix sqrt call |
-| Drift detection (M=1) | $O(d)$ | just dot product |
-| Per-batch overhead | $O(d^2)$ | Computing A_c during training |
-**Total per task**:
-- Training: $O(N_{epochs} \times N_{batches} \times d^2)$ (manageable)
-- Evaluation: $O(|C_{old}| \times d^3)$ (one-time, after training)
-**Memory**: $O(L \times |C_{old}| \times d^2)$ cho lưu covariance matrices (reasonable)
-### Hyperparameter Settings (Recommended)
-```
-λ₁_init = 0.1          # mutual information weight
-λ₂_init = 0.01         # RKL weight (start small)
-α_layer = 0.5          # per-layer RKL scaling
-τ = 0.05               # temperature for InfoNCE
-warmup_epochs = 5      # before applying geometry loss
-num_exemplars_M = 1    # per old class (memory efficient)
-```
----
-## VI. COMPARATIVE ANALYSIS & EXPECTED IMPACT
-### RTA vs. MINION v17 (Detailed)
-| Criterion | MINION v17 | RTA | Advantage |
-|-----------|-----------|-----|-----------|
-| **Distribution Model** | von Mises-Fisher (isotropic) | Bingham (anisotropic) | RTA captures task-specific anisotropy |
-| **Parameter Geometry** | Euclidean assumptions | Riemannian manifold | RTA preserves topology on $\mathbb{S}^{d-1}$ |
-| **Drift Correction** | Procrustes (linear rotation) | Parallel transport (geodesic path) | RTA avoids error accumulation |
-| **Knowledge Retention** | KL divergence on outputs | Riemannian KL + FIM-weighted | RTA locks feature topology |
-| **Adaptation** | Fixed ensemble weights | Dynamic per-layer scheduling | RTA adapts to feature drift rate |
-| **Drift Detection** | Implicit | Explicit geodesic distance | RTA quantifies drift magnitude |
-| **$M=1$ Reliability** | Low | Medium-High | RTA robust with one exemplar |
-| **Computational Cost** | O($d^2$) per layer | O($d^3$) per task | RTA justified for architecture $d < 2048$ |
-### Expected Benefits
-1. **Theoretical Soundness** ✅
-   - Formalized từ Riemannian geometry + Information theory
-   - Metric preservation guaranteed (no accumulation error)
-   - FIM-weighted retention (principled trade-off)
-2. **Feature-Level Monitoring** ✅
-   - Explicit encoder drift detection (geodesic angle)
-   - Adapt weighting per layer based on drift rate
-   - Early warning: predict forgetting before it happens
-3. **Robustness with Few Exemplars** ✅
-   - Only M=1 exemplar per class required
-   - Used only for geodesic direction (not mean estimation)
-   - Stable covariance via Bingham MLE regularization
-4. **Anisotropy Learning** ✅
-   - Auto-discover task-specific dimensions
-   - Protect important features while allowing update in noise
-   - Implicit soft-attention to discriminative directions
-### Limitations & Mitigation
-1. **Computational Cost** ⚠️
-   - Eigendecomposition ($O(d^3)$) per task
-   - Practical for $d < 2048$, problematic for ViT ($d > 4096$)
-   - **Mitigation**: Low-rank Bingham approximation (top-k eigenvectors)
-2. **Small M Assumption** ⚠️
-   - M=1 not reliable if exemplar outlier
-   - **Mitigation**: Robust covariance (Huber-type)
-3. **Hyperparameter Tuning** ⚠️
-   - Multiple $\lambda$'s to tune
-   - **Mitigation**: Automatic scheduling via validation
-4. **Feature Normalization Requirement** ⚠️
-   - Assumes normalized embeddings
-   - **Mitigation**: Standard practice in modern architectures
----
-## VII. CONCLUSION & RECOMMENDATIONS
-### Summary: Why RTA is "Tighter" than MINION v17
-1. ✅ **Rigorous Mathematics**: Bingham + Riemannian geometry unified framework
-2. ✅ **Explicit Monitoring**: Track feature drift via geodesic distance
-3. ✅ **Metric Preservation**: Parallel Transport guarantees no accumulation error
-4. ✅ **Formal Retention**: RKL via Fisher Information Metric (not ad-hoc)
-5. ✅ **Adaptive Learning**: Per-layer + dynamic weighting based on real drift
-### Trade-offs
-- Higher computational cost (eigendecomposition per task)
-- More hyperparameters (automatic scheduling helps)
-- Requires normalized features (okay for modern architectures)
-### When to Use RTA
-**Use RTA if:**
-- ✅ Catastrophic forgetting is main bottleneck
-- ✅ Feature drift is large (domain shift / diverse tasks)
-- ✅ Can afford $O(d^3)$ computation per task
-- ✅ $d < 2048$ (typical CNN/small transformer)
-**Use simpler methods (EWC, LwI) if:**
-- ✅ Only incremental learning needed (similar domains)
-- ✅ Memory/compute severely limited
-- ✅ Model is large ($d > 4096$)
-**Hybrid approach:**
-- Apply RTA to early+middle layers (detect drift early)
-- Simple EWC regularization on final layer (cheap)
-- 70% of benefits, 40% of cost

human_working_IdeaMethod_and_discuss/new_idea_analysis.md DELETED Viewed

@@ -1,470 +0,0 @@
-# Phân Tích Ý Tưởng Mới: Statistical Knowledge Signatures + OT Routing + Backbone Anti-Drift
-## Comprehensive Analysis Report
----
-# PHẦN 1: TỔNG QUAN Ý TƯỞNG MỚI
-## 1.1 Bối cảnh & Động lực
-Quan sát: Các paper top conference 2025 (NeurIPS, ICML, ICLR, ACL...) quan tâm rất nhiều tới **knowledge isolation via submodule + routing**:
-- GainLoRA (NeurIPS'25): LoRA branches + gating
-- MINGLE (NeurIPS'25): MoE + Null-Space Gating
-- SMoLoRA (ICCV'25): Separable Mixture of LoRA
-- TreeLoRA (ICML'25): Hierarchical gradient-similarity tree
-- HiDe-LLaVA (ACL'25): Task-specific expansion + CKA fusion
-- MoE-Adapters (CVPR'24): Standard MoE routing
-- ... và nhiều paper khác
-→ Xu hướng rõ ràng: **Submodule architecture + routing mechanism** là paradigm chủ đạo 2025.
-## 1.2 Ba Thành Phần Của Ý Tưởng Mới
-### Component 1: Statistical Knowledge Signatures
-- Sử dụng công cụ thống kê mạnh (vMF, Bingham, GMM...) để **khái quát hóa không gian tri thức** của mỗi module
-- Mỗi module/expert có một "chữ ký thống kê" (signature/fingerprint) mô tả phân phối dữ liệu mà nó đã học
-- Khác biệt với gating networks: signature mang ý nghĩa thống kê rõ ràng, không phải learned weights
-### Component 2: Optimal Transport Routing
-- Sử dụng OT làm **cơ chế routing có nguyên tắc** (principled routing)
-- Cost matrix dựa trên **khoảng cách phân phối** giữa input và signatures của các modules
-- Thay thế softmax gating/top-k selection bằng OT matching
-### Component 3: Backbone Anti-Drift & Anti-Invasion
-- Phần backbone chung (shared) được bảo vệ bởi:
-  - Loss phạt drift representation (tâm cụm cũ không được trôi quá xa)
-  - Loss phạt xâm lấn (class mới không được xâm phạm vùng class cũ)
-- Kế thừa từ simple_idea cũ, áp dụng vào modular architecture
----
-# PHẦN 2: ĐÁNH GIÁ TÍNH MỚI (NOVELTY ASSESSMENT)
-## 2.1 Kết luận tổng quát: **NOVELTY CAO**
-Không có paper nào (trong 109 papers khảo sát + ~30 papers bổ sung) kết hợp cả 3 thành phần. Từng thành phần riêng lẻ có prior work nhưng ở **mục đích và cách dùng khác**.
-## 2.2 Cross-check với 109 Papers Khảo Sát
-### Component 1 — Statistical Signatures cho Modules
-| Paper | Gì đã làm | Khác biệt với ý tưởng mới |
-|-------|-----------|---------------------------|
-| **35. Feature Distributions** (ICML'25) | "Presentative feature distribution" để chọn PEFT block | Distribution = **mean vector only**, không phải rich statistical model (vMF/Bingham). Dùng cho block selection, không phải knowledge fingerprint |
-| **73. PromptCCD** (ECCV'24) | GMM cho prompt pool routing | GMM = Gaussian, không geometric. Dùng cho category discovery, không phải CL routing |
-| **96. FeCAM** (NeurIPS'23) | Class-specific covariance + Mahalanobis | Statistical modeling nhưng cho **classification** (single model), không phải module signature |
-| **65. CLAP4CLIP** (NeurIPS'24) | Probabilistic feature modeling | Gaussian distribution, CLIP-based, không phải module fingerprint |
-**Kết luận Component 1:** Paper 35 gần nhất nhưng chỉ dùng mean-vector representation, không phải rich statistical model. **Không có paper nào dùng vMF/Bingham/Directional distributions làm "chữ ký tri thức" cho module.**
-### Component 2 — OT-based Routing
-| Paper | Routing mechanism | Khác biệt |
-|-------|------------------|-----------|
-| **01. GainLoRA** (NeurIPS'25) | Gating modules | Learned gating, không distributional |
-| **02. MINGLE** (NeurIPS'25) | Null-Space Constrained Gating | Algebraic constraint, không OT |
-| **09. MoDE** (NeurIPS'25) | Modality-based separation | By modality, không distributional |
-| **14. SMoLoRA** (ICCV'25) | Dual routing (visual + instruction) | Separable by function, không OT |
-| **21. PLAN** (ICCV'25) | Orthogonal basis allocation | Algebraic, không OT |
-| **23. ARM** (ACL'25) | Activation-guided routing | Activation-based, không distributional |
-| **27. HiDe-LLaVA** (ACL'25) | CKA similarity fusion | Similarity metric, không OT |
-| **41. TreeLoRA** (ICML'25) | Gradient-similarity tree | Gradient-based, không distributional |
-| **82. MoE-Adapters** (CVPR'24) | Standard MoE gating | Softmax gating, không OT |
-| **102. MRN** (ICCV'23) | Multiplexed routing | Language-specific paths, không OT |
-**Kết luận Component 2: Trong 109 papers, KHÔNG có paper nào dùng OT cho routing trong CL.** Tất cả dùng gating networks, activation-based, gradient-similarity, hoặc algebraic constraints.
-### Component 3 — Backbone Anti-Drift trong Modular Architecture
-| Paper | Drift handling | Khác biệt |
-|-------|---------------|-----------|
-| **77. LDC** (ECCV'24) | Learnable drift compensation | **Single model**, không phải modular backbone |
-| **20. Dual Drift** (ICCV'25) | Prototype drift analysis | Single model, prototype-level |
-| **61. LoRA-** (CVPR'25) | Drift-Resistant Space | LoRA subtraction, không phải anti-drift loss |
-| **47. Proxy-FDA** (ICML'25) | Feature distribution alignment | Single model + proxies |
-| **13. MG-CLIP** (ICCV'25) | Modality gap preservation | CLIP-specific, không phải backbone share |
-**Kết luận Component 3: Drift compensation đã được nghiên cứu, nhưng TRONG CONTEXT SINGLE-MODEL.** Không có paper nào áp dụng anti-drift + anti-invasion loss cho **backbone của modular architecture.**
-## 2.3 Cross-check với Papers Bổ Sung (Ngoài 109)
-### OT trong MoE/Routing (không phải CL)
-| Paper | Chi tiết | Mối quan hệ |
-|-------|---------|-------------|
-| **BASE Layers** (ICML'21) | OT (linear assignment) cho balanced expert allocation | OT dùng cho **load-balancing**, KHÔNG phải distribution-matching routing. Cost matrix = learned scores, không phải distributional distances |
-| **Grassmannian MoE** (arXiv Feb'26) | Matrix Bingham distributions trên Grassmannian manifold cho routing | **RỦI RO CAO NHẤT** — dùng Bingham cho routing. NHƯNG: (a) KHÔNG phải CL, (b) Bingham controls routing entropy (sparsity), KHÔNG characterize knowledge |
-| **Selective Sinkhorn Routing** (Nov'25) | Sinkhorn-based routing cho MoE | OT cho load-balancing, không phải knowledge matching |
-### Statistical Distributions trong CL (không phải module signatures)
-| Paper | Chi tiết | Mối quan hệ |
-|-------|---------|-------------|
-| **vMF for Online CL** (AAAI'24) | vMF distribution cho online CL | vMF dùng như **training loss** (concentration penalty), KHÔNG dùng làm module fingerprint |
-| **SCDEM** (Apr'25) | OT trong CL context | OT cho **feature alignment**, không phải routing |
-### MoE + CL (không phải OT routing)
-| Paper | Chi tiết | Routing mechanism |
-|-------|---------|------------------|
-| **CaRE** (arXiv Feb'26) | Continual Learning with Routing among Experts | Learned routing, không OT |
-| **PASs-MoE** (arXiv Jan'26) | Parameter-Adaptive Sparse MoE | Adaptive sparsity, không OT |
-| **TRGE** (arXiv Aug'25) | Task-Regularized Gradient Experts | Gradient-based expert selection |
-## 2.4 Phân Tích Rủi Ro Novelty
-### Rủi ro CAO — Grassmannian MoE (arXiv:2602.17798)
-- **Overlap:** Dùng Bingham distribution + manifold geometry cho routing
-- **Khác biệt quan trọng:**
-  1. KHÔNG phải CL — chỉ là MoE cho language modeling
-  2. Bingham controls **routing entropy** (sparsity vs utilization tradeoff)
-  3. KHÔNG characterize "knowledge" của expert — chỉ control gating weight distribution
-  4. KHÔNG có anti-drift/anti-invasion component
-- **Kết luận:** Có thể cite as related work nhưng mục đích hoàn toàn khác
-### Rủi ro TRUNG BÌNH — Paper 35 (Feature Distributions, ICML'25)
-- **Overlap:** Dùng "feature distribution" để chọn module
-- **Khác biệt:** Distribution = mean-vector, không rich statistical model. Dùng cho PEFT block selection, không phải principled routing
-- **Kết luận:** Có thể position ý tưởng mới như generalization/upgrade
-### Rủi ro THẤP — Các paper còn lại
-- BASE Layers, SERS, FeCAM: Mỗi paper chỉ chạm 1 component ở mức surface-level
-## 2.5 Bốn Khoảng Trống Novelty Được Xác Nhận
-| # | Novelty Gap | Chưa có paper nào làm |
-|---|-------------|----------------------|
-| 1 | **Rich statistical signatures** | Dùng vMF/Bingham/directional distributions làm fingerprint cho expert knowledge space |
-| 2 | **OT with distributional-distance cost** | OT routing dựa trên khoảng cách phân phối (KL, Wasserstein) giữa input và module signatures |
-| 3 | **Three-component integration** | Kết hợp statistical signatures + OT routing + backbone protection trong 1 framework |
-| 4 | **Anti-drift/invasion trong modular backbone** | Áp dụng center drift penalty + invasion loss cho shared backbone của modular architecture |
----
-# PHẦN 3: PHÂN TÍCH TÍNH HỢP LÝ (SOUNDNESS ANALYSIS)
-## 3.1 Component 1 — Statistical Knowledge Signatures
-### Hợp lý ✅
-- **Cơ sở lý thuyết:** Feature space của các encoder hiện đại (BERT, ViT) thường nằm trên manifold có cấu trúc (hypersphere cho normalized features, cone cho ReLU features). Dùng distribution phù hợp geometry (vMF cho hypersphere, Bingham cho elliptical) capture nhiều thông tin hơn mean vector.
-- **Ưu điểm so với gating network:** Signature có interpretability (có thể đo concentration, direction, spread), trong khi gating weights là black-box.
-- **Evidence từ literature:**
-  - FeCAM (96): Chứng minh class-specific covariance (statistical tool) tốt hơn mean-only prototype
-  - CLAP4CLIP (65): Probabilistic modeling > deterministic features
-  - Angle Matters (48): Angle/direction trong feature space quyết định forgetting → distribution captures direction information
-### Điểm cần lưu ý ⚠️
-- **Cách c���p nhật incremental:** Khi task mới đến, signature cần update. vMF có sufficient statistics (mean direction + concentration) → có thể online update. GMM phức tạp hơn.
-- **Chi phí lưu trữ:** Mỗi module cần lưu signature parameters. vMF: O(d+1) mỗi module (mean direction vector + κ). Bingham: O(d²) mỗi module. Với d nhỏ (projection) → chấp nhận được.
-- **Khuyến nghị:** Bắt đầu với vMF (đơn giản nhất, phù hợp hypersphere features) → mở rộng Bingham/GMM nếu cần.
-## 3.2 Component 2 — OT-based Routing
-### Hợp lý ✅
-- **Cơ sở lý thuyết:** OT cung cấp optimal matching giữa 2 distributions, là framework tự nhiên cho "matching input to expert". Sinkhorn algorithm cho phép differentiable approximation.
-- **Ưu điểm so với softmax gating:**
-  - **Principled:** Tối ưu hóa global assignment thay vì local gating scores
-  - **Load-balanced by design:** OT constraints tự nhiên balance load (đã chứng minh trong BASE Layers)
-  - **Distribution-aware:** Cost matrix encode khoảng cách phân phối, không phải raw scores
-- **Feasibility:** Sinkhorn iterations: O(n²·k) với n tokens, k experts. Với k nhỏ (CL thường 5-20 experts) → tractable.
-### Điểm cần lưu ý ⚠️
-- **Inference latency:** Sinkhorn cần iterative → chậm hơn softmax gating đơn giản. Mitigation: ít iterations (5-10), hoặc amortized inference.
-- **Cost matrix construction:** Cần define cách tính khoảng cách giữa input sample/batch và module signature. Options: vMF log-likelihood, Wasserstein distance, KL divergence.
-- **Khuyến nghị:** Dùng Sinkhorn với regularization ε lớn (fast convergence) + vMF log-likelihood as cost.
-## 3.3 Component 3 — Backbone Anti-Drift
-### Hợp lý ✅
-- **Cơ sở lý thuyết:** Shared backbone trong modular architecture vẫn bị update → representation drift. No paper hiện tại address this explicitly.
-- **Evidence:**
-  - LDC (77): Chứng minh drift compensation cải thiện performance
-  - Dual Drift (20): Inner-task + inter-task prototype drift đều gây forgetting
-  - LoRA- (61): Drift-resistant space concept validates the need
-- **Tự nhiên với modular architecture:** Backbone là phần chia sẻ giữa tất cả modules → drift ảnh hưởng TẤT CẢ old tasks đồng thời. Anti-drift loss ở backbone level → bảo vệ toàn bộ.
-### Điểm cần lưu ý ⚠️
-- **Balance plasticity-stability:** Anti-drift loss quá mạnh → backbone không học được features mới. Cần adaptive weighting.
-- **Anti-invasion definition:** Trong modular architecture, "vùng class cũ" được define qua module signatures → tự nhiên link với Component 1.
-- **Khuyến nghị:** Dùng EMA-based center tracking + dynamic λ scheduling (từ method.md RTA framework).
-## 3.4 Tính Nhất Quán Nội Bộ (Internal Consistency)
-| Aspect | Assessment | Giải thích |
-|--------|-----------|------------|
-| Component 1 ↔ 2 | ✅ Consistent | Signatures (C1) cung cấp distribution cho OT cost matrix (C2). Chúng designed to work together. |
-| Component 2 ↔ 3 | ✅ Consistent | OT routing (C2) phân bổ input → modules. Anti-drift (C3) bảo vệ shared backbone. Hai cơ chế orthogonal, không conflict. |
-| Component 1 ↔ 3 | ✅ Synergistic | Signatures (C1) cũng detect drift: nếu backbone drift → feature distribution thay đổi → signatures outdated → signal để trigger anti-drift. |
-## 3.5 Đánh Giá Tổng Thể Tính Hợp Lý
-**Ý tưởng hợp lý ở mức idea-level.** Ba thành phần có cơ sở lý thuyết vững, tương thích nội bộ, và address gap thực sự trong literature. Tiềm năng contribution mạnh nếu implementation đúng.
-**Rủi ro lớn nhất:** Computational overhead (OT + distribution estimation + anti-drift) có thể significant. Cần careful engineering.
----
-# PHẦN 4: KHẢO SÁT PAPERS 2025 — MOTIVATION ĐỂ APPLY Ý TƯỞNG MỚI
-## 4.1 Tiêu Chí Đánh Giá Mới (cho New Idea)
-| Tiêu chí | Mô tả | Trọng số |
-|----------|--------|----------|
-| **M1. Submodule architecture** | Paper dùng multi-module/expert/LoRA → new idea phù hợp | ★★★ |
-| **M2. Routing có thể nâng cấp** | Routing hiện tại đơn giản (gating, top-k) → OT routing có thể improve | ★★★ |
-| **M3. Backbone drift problem** | Paper có shared backbone bị drift → anti-drift loss applicable | ★★ |
-| **M4. Domain phù hợp** | ML/NLP ưu tiên, CV thấp hơn | ★★ |
-| **M5. Reproducibility** | Có code, benchmark rõ ràng | ★ |
-Lưu ý: Đánh giá ở mức **phác thảo** — xem paper có motivation/feasibility để apply, KHÔNG xem chi tiết công cụ cụ thể (vMF có hợp hay không).
-## 4.2 Papers 2025 Có Motivation Cao (Score ≥ 7/10)
-### 🥇 Paper 01 | GainLoRA | NeurIPS'25 | NLP
-**Motivation Score: 9/10**
-- ✅ M1: LoRA branches per task + gating modules — multi-module architecture
-- ✅ M2: Gating = simple learned module → OT routing có thể thay thế, phân bổ principled hơn
-- ✅ M3: Shared base model bị update → backbone drift likely
-- ✅ M4: NLP (LLM continual learning)
-- **Lý do apply:** GainLoRA dùng gating đơn giản để integrate LoRA branches. Thay gating bằng (1) statistical signature cho mỗi LoRA branch + (2) OT routing matching input distribution → principled expert selection. Anti-drift loss bảo vệ base LLM.
-### 🥇 Paper 02 | MINGLE | NeurIPS'25 | ML
-**Motivation Score: 9/10**
-- ✅ M1: MoE + low-rank experts + gating
-- ✅ M2: Null-Space Constrained Gating — algebraic, không capture knowledge distribution
-- ✅ M3: Test-time merging implies shared components
-- ✅ M4: ML/Multi
-- **Lý do apply:** MINGLE dùng null-space projection cho gating. Statistical signatures sẽ capture knowledge space richer hơn null-space constraint. OT routing provides global optimal assignment thay vì local gating.
-### 🥇 Paper 41 | TreeLoRA | ICML'25 | ML
-**Motivation Score: 9/10**
-- ✅ M1: Layer-wise LoRA allocation via hierarchical tree
-- ✅ M2: Gradient-similarity → heuristic, không capture full knowledge distribution
-- ✅ M3: Shared pretrained model as backbone
-- ✅ M4: ML (cả ViTs + LLMs)
-- **Lý do apply:** TreeLoRA dùng gradient similarity để allocate LoRA. Gradient similarity = proxy cho task similarity nhưng không capture full distribution. Statistical signatures cho mỗi LoRA node trong tree → richer characterization. OT routing thay multi-armed bandit.
-### 🥈 Paper 14 | SMoLoRA | ICCV'25 | ML/Multi
-**Motivation Score: 8/10**
-- ✅ M1: Separable Mixture of LoRA + dual routing
-- ✅ M2: Dual routing (visual + instruction) → có thể upgrade sang OT matching
-- ⚠️ M3: Shared backbone (VL model)
-- ✅ M4: VL (multimodal, nhưng IT setting phổ dụng)
-- **Lý do apply:** SMoLoRA dùng separable routing cho 2 modalities. OT routing có thể unify dual routing thành 1 cost matrix, với signatures capture both visual + instruction knowledge.
-### 🥈 Paper 35 | Feature Distributions | ICML'25 | NLP
-**Motivation Score: 8/10**
-- ✅ M1: Multi-PEFT-block (expanding/reusing)
-- ✅ M2: "Presentative feature distribution" for block selection — TRỰC TIẾP liên quan nhưng dùng mean-vector, not rich statistics
-- ⚠️ M3: Pre-trained LLM backbone
-- ✅ M4: NLP (LLM continual learning)
-- **Lý do apply:** Paper ĐÃ dùng idea "feature distribution" để chọn block → **đây chính là starting point tốt nhất** cho new idea. Upgrade: thay mean-vector bằng vMF signature + thay selection bằng OT routing. Paper đã validate rằng distribution-based selection works.
-### 🥈 Paper 82 | MoE-Adapters | CVPR'24 | ML/Multi
-**Motivation Score: 8/10**
-- ✅ M1: MoE adapter architecture
-- ✅ M2: Standard MoE gating → classic candidate cho OT routing upgrade
-- ⚠️ M3: VLM backbone
-- ⚠️ M4: VL (CV-leaning)
-- **Lý do apply:** Standard MoE gating là simplest routing, easiest to upgrade to OT. Có code (github.com/JiazuoYu/MoE-Adapters4CL).
-### 🥈 Paper 27 | HiDe-LLaVA | ACL'25 | NLP
-**Motivation Score: 8/10**
-- ✅ M1: Task-specific expansion + task-general fusion
-- ✅ M2: CKA similarity guides layer-wise handling → distribution signatures provide richer similarity
-- ✅ M3: Shared LLaVA backbone
-- ✅ M4: NLP (instruction tuning)
-- **Lý do apply:** HiDe-LLaVA dùng CKA similarity → scalar measure. Distribution signature captures richer information (direction, spread, concentration). OT routing replaces CKA-based fusion.
-### 🥈 Paper 23 | ARM | ACL'25 | ML
-**Motivation Score: 8/10**
-- ✅ M1: MoE (Knowledge Experts) + routing
-- ✅ M2: Activation-guided routing → doesn't capture knowledge distribution
-- ⚠️ M3: LLM backbone
-- ✅ M4: NLP (knowledge editing, nhưng MoE architecture phổ biến)
-- **Lý do apply:** ARM dùng activation-guided routing (heuristic). Statistical signatures + OT routing provides principled alternative.
-## 4.3 Papers 2025 Có Motivation Trung Bình (Score 5-7/10)
-### Paper 09 | MoDE | NeurIPS'25 | ML/Multi
-**Motivation Score: 7/10**
-- ✅ M1: Modality-specific experts
-- ⚠️ M2: Expert isolation by modality (not really routing) → OT routing less applicable
-- ✅ M3: Unified model backbone
-- **Lý do:** Routing theo modality → fixed, không cần OT. Nhưng anti-drift cho backbone hữu ích.
-### Paper 21 | PLAN | ICCV'25 | ML
-**Motivation Score: 7/10**
-- ✅ M1: Orthogonal basis vectors per task
-- ⚠️ M2: Orthogonal allocation ≠ routing (pre-determined), nhưng distribution signatures có thể guide allocation
-- ✅ M3: Shared backbone
-- **Lý do:** PLAN allocate trước, không route at inference. Nhưng signatures có thể guide better allocation.
-### Paper 08 | CaLoRA | NeurIPS'25 | ML
-**Motivation Score: 6/10**
-- ✅ M1: LoRA branches + causal analysis
-- ⚠️ M2: Gradient projection based on task correlation — already somewhat distributional
-- ⚠️ M3: LoRA-level, not backbone
-- **Lý do:** CaLoRA đã dùng causal attribution → more sophisticated than simple gating. OT routing vẫn có thể improve nhưng gap nhỏ hơn.
-### Paper 18 | Instruction-Grounded VP | ICCV'25 | ML/Multi
-**Motivation Score: 6/10**
-- ✅ M1: Mixture of visual projectors
-- ⚠️ M2: Expert recommendation + pruning → OT could improve recommendation
-- ⚠️ M3: VLM backbone shared
-- **Lý do:** Projector-level MoE. OT routing applicable nhưng projector-specific.
-### Paper 17 | TWIST&SCOUT | ICCV'25 | NLP
-**Motivation Score: 5/10**
-- ✅ M1: Twin experts (frozen + learnable)
-- ❌ M2: No routing mechanism (fixed twin structure) — khó apply OT
-- ✅ M3: Shared model backbone
-- **Lý do:** Twin expert structure cố định → không có routing để upgrade. Chỉ Component 3 (anti-drift) applicable.
-### Paper 44 | SEFE | ICML'25 | ML/Multi
-**Motivation Score: 6/10**
-- ✅ M1: RegLoRA (regularized LoRA) — multi-module
-- ⚠️ M2: Regularization-based, not routing
-- ⚠️ M3: Shared backbone
-- **Lý do:** SEFE phân loại forgetting (superficial vs essential). Signatures có thể detect loại forgetting nào.
-### Paper 61 | LoRA- | CVPR'25 | ML
-**Motivation Score: 6/10**
-- ⚠️ M1: LoRA subtraction (not standard MoE routing)
-- ⚠️ M2: Drift-Resistant Space = alternative approach, OT routing không trực tiếp applicable
-- ✅ M3: Drift là central problem → directly relevant to Component 3
-- **Lý do:** Concept DRS và Component 3 (anti-drift) complementary. Có thể combine signatures + DRS.
-### Paper 77 | LDC | ECCV'24 | ML
-**Motivation Score: 6/10**
-- ❌ M1: Single model + lightweight drift module
-- ❌ M2: No routing
-- ✅ M3: Drift compensation → directly relevant to Component 3
-- **Lý do:** LDC concept trực tiếp liên quan Component 3 nhưng single-model → cần adapt to modular setting.
-## 4.4 Papers KHÔNG có motivation (Score < 5)
-Các nhóm papers KHÔNG phù hợp apply:
-- **Knowledge Editing papers** (03, 10, 12, 22, 25, 36, 37, 38, 42, 50): Fact-level editing, không phải representation-level CL
-- **Benchmark/Analysis papers** (34, 37, 48, 52, 90): Không có model để apply
-- **Training-free/Data-level papers** (24, 28, 32, 55, 58, 89): Không có modular architecture
-- **Prompt-based papers** (46, 56, 68, 87, 100, 105, 109): Prompt pool ≠ modular experts
-- **Single-model non-geometric** (04, 11, 16, 40, 79, 95, 97, 104): Không có submodule + routing
----
-# PHẦN 5: LỌC PAPERS KHẢ THI TRÊN T4/P100 (16GB VRAM)
-## 5.1 Tiêu Chí GPU Feasibility
-| Factor | T4/P100 Compatible | Cần > 16GB |
-|--------|-------------------|------------|
-| ViT-B/ViT-L + LoRA | ✅ | |
-| CLIP ViT-B + adapters | ✅ | |
-| BERT/RoBERTa | ✅ | |
-| LLaMA-7B + LoRA (QLoRA 4-bit) | ✅ (borderline) | |
-| LLaMA-7B full fine-tune | | ❌ |
-| LLaMA-13B+ | | ❌ |
-| LLaVA-7B + LoRA | ✅ (tight) | |
-| LLaVA-13B+ | | ❌ |
-| Diffusion models (SD) | ⚠️ depends | |
-## 5.2 Bảng Feasibility — Papers Có Motivation Cao
-| Rank | Paper | Motivation | GPU Feasible | Base Model | Code | Tổng đánh giá |
-|------|-------|-----------|-------------|------------|------|---------------|
-| ⭐1 | **35. Feature Distributions** | 8/10 | ✅ Likely (PEFT on LLM, small modules) | LLM + PEFT blocks | ❌ | **TOP PICK NLP** — closest to idea, PEFT = low VRAM |
-| ⭐2 | **82. MoE-Adapters** | 8/10 | ✅ (CLIP ViT-B/L + adapters) | CLIP ViT | ✅ github | **TOP PICK ML** — standard MoE, clear upgrade path, có code |
-| ⭐3 | **41. TreeLoRA** | 9/10 | ✅ (ViT) / ⚠️ (LLM, depends on size) | ViT + LLM | ❌ | **TOP PICK ML** — tree structure natural for signatures |
-| ⭐4 | **01. GainLoRA** | 9/10 | ⚠️ Depends on LLM size (7B QLoRA OK) | LLM + LoRA | ❌ | **TOP PICK NLP** — nếu LLM ≤ 7B |
-| 5 | **02. MINGLE** | 9/10 | ⚠️ Test-time merging may need multiple models loaded | MoE experts | ❌ | Phức tạp, nhưng high motivation |
-| 6 | **14. SMoLoRA** | 8/10 | ⚠️ (LLaVA-7B + LoRAs → tight) | LLaVA + LoRA | ✅ github | VL, có code, tight memory |
-| 7 | **27. HiDe-LLaVA** | 8/10 | ⚠️ (LLaVA + expansion → tight/infeasible) | LLaVA + expansion | ❌ | Architecture growth → memory grows |
-| 8 | **23. ARM** | 8/10 | ⚠️ Depends on LLM base | LLM + MoE | ❌ | KE domain, phức tạp |
-| 9 | **09. MoDE** | 7/10 | ⚠️ MM model size varies | Unified MM model | ❌ | Multimodal, not pure routing |
-| 10 | **21. PLAN** | 7/10 | ✅ (LoRA-based, small modules) | Pre-trained + LoRA | ❌ | Allocation, not routing |
-## 5.3 Top Recommendations — Ưu tiên ML/NLP + T4/P100 Feasible
-### 🏆 Recommendation #1: Paper 35 — Feature Distributions (ICML'25)
-- **Domain:** NLP (LLM Continual Learning)
-- **Why:** Đây là paper ĐÃ dùng concept "feature distribution" cho module selection → **closest prior work** và **tốt nhất để demonstrate upgrade**. Thay mean-vector bằng vMF signature + thay selection heuristic bằng OT routing → clear, publishable contribution.
-- **GPU:** PEFT blocks = lightweight, likely feasible on T4
-- **Risk:** Không có public code → phải reimplement
-### 🏆 Recommendation #2: Paper 82 — MoE-Adapters (CVPR'24)
-- **Domain:** ML/Multi (VLM Continual Learning)
-- **Why:** Standard MoE gating → **easiest upgrade path** to OT routing. Well-established benchmark. Có public code (github). CLIP-based → T4 feasible.
-- **GPU:** ✅ CLIP ViT-B + adapters fit T4 easily
-- **Risk:** VL domain (not pure NLP), nhưng methodology general
-### 🏆 Recommendation #3: Paper 41 — TreeLoRA (ICML'25)
-- **Domain:** ML (ViTs + LLMs)
-- **Why:** Hierarchical structure rất phù hợp cho statistical signatures (signature tại mỗi tree node). Gradient-similarity → natural upgrade to distribution-based similarity. ICML'25 = strong baseline.
-- **GPU:** ✅ cho ViT experiments. ⚠️ cho LLM tùy size.
-- **Risk:** Không có code, phức tạp hơn (tree structure + bandit)
-### 🏆 Recommendation #4: Paper 01 — GainLoRA (NeurIPS'25)
-- **Domain:** NLP (LLM Continual Learning)
-- **Why:** LoRA branches + gating = classic substrate cho OT routing upgrade. NeurIPS'25 = top venue. LLM CL = hot topic.
-- **GPU:** ⚠️ Nếu base model ≤ 7B + QLoRA → feasible. Nếu > 13B → không.
-- **Risk:** Không có code, LLM base model size uncertain
-### 🏆 Recommendation #5: Paper 14 — SMoLoRA (ICCV'25)
-- **Domain:** ML/Multi (VL Instruction Tuning)
-- **Why:** Dual-routing concept → OT có thể unify. Có code (github). ICCV'25.
-- **GPU:** ⚠️ LLaVA-7B + multiple LoRAs → tight on T4 nhưng có thể feasible với optimization.
-- **Risk:** VL domain, memory tight
-## 5.4 Bảng Tóm Tắt Ưu Tiên
-| Priority | Paper | Domain | Motivation | GPU | Code | Action |
-|----------|-------|--------|-----------|-----|------|--------|
-| **1st** | 35 Feature Dist | NLP | 8 | ✅ | ❌ | Reimplement + upgrade distribution + OT |
-| **2nd** | 82 MoE-Adapters | ML | 8 | ✅ | ✅ | Direct upgrade gating → OT routing |
-| **3rd** | 41 TreeLoRA | ML | 9 | ✅/⚠️ | ❌ | Upgrade gradient-similarity → distribution signatures |
-| **4th** | 01 GainLoRA | NLP | 9 | ⚠️ | ❌ | If LLM ≤ 7B, upgrade gating → OT |
-| **5th** | 14 SMoLoRA | ML/VL | 8 | ⚠️ | ✅ | Unify dual routing → OT, có code |
----
-# PHẦN 6: TỔNG KẾT & KHUYẾN NGHỊ
-## 6.1 Tóm Tắt Đánh Giá
-| Dimension | Assessment | Chi tiết |
-|-----------|-----------|----------|
-| **Novelty** | 🟢 **CAO** | 4 novelty gaps confirmed. Grassmannian MoE là rủi ro cao nhất nhưng khác mục đích |
-| **Soundness** | 🟢 **HỢP LÝ** | 3 components có cơ sở lý thuyết, consistent nội bộ, synergistic |
-| **Motivation cho 2025** | 🟢 **MẠNH** | 8+ papers có architecture phù hợp để apply. Xu hướng submodule+routing support idea |
-| **T4/P100 Feasibility** | 🟡 **KHẢ THI CÓ ĐIỀU KIỆN** | 3-5 papers feasible (PEFT/CLIP-based). LLM >7B cần QLoRA hoặc smaller model |
-## 6.2 Chiến Lược Đề Xuất
-### Phase 1: Proof-of-concept (1-2 tháng)
-- **Target:** Paper 82 (MoE-Adapters) — có code, T4 feasible, clear upgrade path
-- **Goal:** Implement statistical signatures (vMF) + OT routing thay thế standard gating
-- **Validation:** So sánh với baseline MoE gating trên same benchmarks
-### Phase 2: Main contribution (2-3 tháng)
-- **Target:** Paper 35 (Feature Distributions) hoặc Paper 01 (GainLoRA)
-- **Goal:** Full framework với 3 components (signatures + OT + anti-drift)
-- **Contribution:** Demonstrate superior performance qua principled routing + backbone protection
-### Phase 3: Paper writing
-- **Position:** "From Gating to Matching: Statistical Knowledge Signatures with Optimal Transport Routing for Continual Learning"
-- **Claim:** Principled routing via distribution matching outperforms heuristic gating in modular CL
-## 6.3 Rủi Ro & Mitigation
-| Risk | Level | Mitigation |
-|------|-------|-----------|
-| Grassmannian MoE tiếp cận CL | Medium | Differentiate: knowledge characterization vs routing entropy control |
-| OT inference overhead | Medium | Sinkhorn with few iterations + ε-regularization |
-| Lack of code for most targets | Medium | Start with Paper 82 (có code) |
-| vMF not suitable for all feature spaces | Low | Test multiple distributions; fallback to GMM |
-| Combined overhead too high for T4 | Medium | Start with small-scale experiments (ViT-B) |
----
-*Generated: Analysis of new_idea_modifier.txt against 109 surveyed papers + ~30 additional papers*
-*Focus: Novelty, Soundness, Motivation for 2025 papers, T4/P100 Feasibility*

human_working_IdeaMethod_and_discuss/new_idea_modifier.txt DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:e4be554e74456bb1fe8accf18e67ac0ff04cb9cd053ba4795fa8f9edaa14f1ca
-size 647

human_working_IdeaMethod_and_discuss/novelty_search_report.md DELETED Viewed

@@ -1,168 +0,0 @@
-# Comprehensive Novelty Search Report
-## Proposed Idea: Statistical Knowledge Signatures + OT Routing + Backbone Anti-Drift for Continual Learning
-**Date**: March 6, 2026
-**Search Scope**: arXiv (multi-query), specific paper fetches, workspace context analysis
----
-## I. EXISTING WORK: Papers That Partially Overlap
-### A. OT-Based Routing in MoE (Component 2 overlap)
-| # | Paper | Year | Venue | Relevance |
-|---|-------|------|-------|-----------|
-| 1 | **BASE Layers: Simplifying Training of Large, Sparse Models** (arXiv:2103.16716) | 2021 | ICML | Formulates token-to-expert assignment as a **linear assignment problem** (a special case of OT). Guarantees balanced compute loads without auxiliary losses. |
-| 2 | **Selective Sinkhorn Routing for Improved Sparse MoE** (arXiv:2511.08972) | 2025 | - | Formulates token-to-expert assignment as an **optimal transport problem** using Sinkhorn algorithm. Derives gating scores directly from transport map. **Most directly relevant to Component 2.** |
-| 3 | **Sparsity-Constrained Optimal Transport** (arXiv:2209.15466) | 2023 | ICLR | Theoretical OT framework with sparsity constraints applicable to MoE routing. |
-| 4 | **Continual Pre-training of MoEs: How robust is your router?** (arXiv:2503.05029) | 2025 | - | Studies Sinkhorn-balanced routing during continual pre-training. Shows surprising robustness of OT-based routing to distribution shift in CL settings. |
-**Key Difference from Proposed Idea**: These works use OT for **load-balancing** (assigning tokens to experts evenly). The proposed idea uses OT to **match input distributions to expert knowledge signatures** — a fundamentally different formulation where the cost matrix is derived from statistical distribution distances (e.g., vMF-to-vMF), not learned linear projections.
-### B. MoE + Routing for Continual Learning (Components 1+2 overlap)
-| # | Paper | Year | Venue | Relevance |
-|---|-------|------|-------|-----------|
-| 5 | **Scaling CL with Bi-Level Routing MoE (CaRE)** (arXiv:2602.03473) | 2026 | - | Bi-level routing: first selects task-specific routers, then routes to experts. Scales to 300+ tasks. Uses learned routers, not distribution matching. |
-| 6 | **PASs-MoE: Mitigating Misaligned Co-drift among Router and Experts** (arXiv:2601.13020) | 2026 | - | Identifies "misaligned co-drift" between router & experts in CL. Uses LoRA pathway activation subspaces for routing. Addresses router drift but not via OT or statistical signatures. |
-| 7 | **Separation and Collaboration: Two-Level Routing Grouped MoE for MDCL** (arXiv:2508.07738) | 2025 | - | Two-level routing (inter-group via task prototypes, intra-group via learned router). Uses task prototype distance for routing — conceptually related to "matching to knowledge signatures" but prototypes are simple mean vectors, not rich statistical distributions. |
-| 8 | **SCDEM: Self-Controlled Dynamic Expansion Model for CL** (arXiv:2504.10561) | 2025 | - | Multi-backbone + dynamic expert expansion. Uses **OT distance** for Feature Distribution Consistency (FDC) to align old/new representations. **Closest overlap: uses OT in CL with expert expansion, but OT is for feature alignment, NOT routing.** |
-| 9 | **Boosting CL of VLMs via MoE Adapters** (arXiv:2403.11549) | 2024 | CVPR | MoE adapters for continual VLM learning with routing. Standard softmax gating. |
-| 10 | **SAME: Stabilized MoE for Multimodal Continual Instruction Tuning** (arXiv:2602.01990) | 2026 | - | MoE for continual instruction tuning. Focuses on stabilization strategies. |
-| 11 | **Dynamic MoE of Curriculum LoRA Experts for Continual Multimodal IT** (arXiv:2506.11672) | 2025 | ICML | Dynamic architecture expansion under budget. Curriculum-based expert management. |
-| 12 | **MoTE: Mixture of Task-specific Experts for PTM-Based CIL** (arXiv:2506.11038) | 2025 | KBS | Task-specific experts with pre-trained model. Standard routing mechanisms. |
-### C. Statistical Distributions in Continual Learning (Component 1 overlap)
-| # | Paper | Year | Venue | Relevance |
-|---|-------|------|-------|-----------|
-| 13 | **vMF/Angular Gaussian for Online CL** (arXiv:2306.03364) | 2024 | AAAI | Uses vMF and Angular Gaussian distributions for **representation learning** in online CL. Pushes representations toward fixed prior directions on hypersphere. **Directly relevant to Component 1** — but uses vMF as a loss function, NOT as a routing signature for expert modules. |
-| 14 | **Interactive CL: Fast and Slow Thinking** (arXiv:2403.02628) | 2024 | CVPR | vMF-related distributions in CL context for cognitive-inspired learning. |
-| 15 | **General Incremental Learning with Domain-aware Categorical Representations** (arXiv:2204.04078) | 2022 | CVPR | Domain-aware representations for incremental learning using distributional methods. |
-### D. Backbone Feature Drift Compensation (Component 3 overlap)
-| # | Paper | Year | Venue | Relevance |
-|---|-------|------|-------|-----------|
-| 16 | **Exemplar-free CL via Learnable Drift Compensation (LDC)** (arXiv:2407.08536) | 2024 | ECCV | Learns a drift compensation module to correct for feature drift in backbones. **Directly relevant to Component 3** but uses a learned correction, not a penalty loss. |
-| 17 | **Exemplar-free CL of ViTs via Gated Class-Attention and Cascaded Feature Drift Compensation** (arXiv:2211.12292) | 2023 | - | Gated class-attention to minimize transformer drift + cascaded feature drift compensation. Relevant to anti-drift but uses gating/masking, not OT or invasion penalty. |
-| 18 | **Scalable Analytic Classifiers with Associative Drift Compensation for CIL** (arXiv:2602.00144) | 2026 | - | Analytic classifiers with drift compensation for ViTs. Uses Gaussian Discriminant Analysis. |
-| 19 | **Feature Drift Compensation Projection for Data-free Replay Continual Face Forgery Detection** (arXiv:2508.03189) | 2025 | - | Feature drift compensation projection for continual face forgery detection. |
-| 20 | **Resurrecting Old Classes with New Data for Exemplar-Free CL** (arXiv:2405.19074) | 2024 | CVPR | Addresses drift compensation without exemplars. |
-### E. Optimal Transport in Continual Learning (General)
-| # | Paper | Year | Venue | Relevance |
-|---|-------|------|-------|-----------|
-| 21 | **Merging without Forgetting: Continual Fusion via OT** (arXiv:2511.19561) | 2025 | - | Uses OT for **model merging** in CL (aligning task-specific model weights). OT used for weight-space alignment, NOT input routing. |
-| 22 | **LwI (workspace existing work)** | - | - | Uses OT (Sinkhorn) for **neuron alignment** between old and new models during continual learning. OT for model merging/alignment, not routing. |
-### F. Geometric/Statistical Routing (Component 1+2 joint overlap)
-| # | Paper | Year | Venue | Relevance |
-|---|-------|------|-------|-----------|
-| 23 | **Grassmannian MoE: Concentration-Controlled Routing on Subspace Manifolds** (arXiv:2602.17798) | 2026 | - | Routes using **Matrix Bingham distributions** on the Grassmannian manifold to control routing entropy. **HIGHEST OVERLAP WITH PROPOSED IDEA.** Uses statistical distributions (Bingham) for routing with concentration parameters as control knobs. However: (a) not CL-specific, (b) distributions characterize routing preferences, not task knowledge, (c) no drift/anti-invasion mechanisms. |
-| 24 | **Spectral Manifold Regularization for Stable Routing in Deep MoE** (arXiv:2601.03889) | 2026 | - | Manifold-based regularization for stable/modular routing. May overlap with geometric characterization concepts. |
----
-## II. NOVELTY GAPS: What Has NOT Been Done
-### GAP 1: Statistical Knowledge Signatures as Expert "Fingerprints" (HIGH NOVELTY)
-**No existing work** creates rich statistical distribution-based "signatures" (vMF, Bingham, GMM, etc.) that characterize what each expert **knows** — i.e., the knowledge space/competence region of each submodule. Existing works either:
-- Use vMF as a **training loss** (Michel et al., AAAI 2024) — not as a module descriptor
-- Use Bingham distributions for **routing control** (GrMoE, 2026) — not for knowledge characterization
-- Use simple prototypes/centroids for task matching (TRGE, 2025) — not rich distributional signatures
-**Your contribution**: Using multi-modal statistical distributions (vMF, Bingham, GMM combinations) as a formal **fingerprint** of each module's learned knowledge region. This creates a principled, interpretable language for what each expert "knows."
-### GAP 2: OT as Distribution-Matching Routing (not just Load-Balancing) (HIGH NOVELTY)
-All existing OT-based routing (BASE Layers, Sinkhorn Routing, SSR) uses OT to solve a **load-balancing** problem: distribute tokens evenly across experts. The cost matrix is typically derived from learned linear projections.
-**No existing work** uses OT with a cost matrix derived from **distributional distances** between input statistics and expert knowledge signatures. This is a qualitatively different OT formulation:
-- Existing: $\min_{\pi} \sum_{ij} c_{ij}\pi_{ij}$ where $c_{ij} = -\text{score}(x_i, e_j)$ (learned similarity)
-- Proposed: $\min_{\pi} \sum_{ij} d(P_{\text{input}_i}, Q_{\text{expert}_j})\pi_{ij}$ where $d$ is a distributional distance (e.g., KL between vMF distributions)
-### GAP 3: Three-Component Integration (VERY HIGH NOVELTY)
-**No paper** combines all three:
-1. Statistical distribution signatures for module knowledge
-2. OT-based distribution-matching routing
-3. Backbone anti-drift + anti-invasion penalty
-The closest works address at most 2 of 3 and in different ways:
-- SCDEM: OT for alignment + expert expansion (but no signature-based routing, no anti-invasion)
-- GrMoE: Statistical routing (but not CL, no drift penalty)
-- PASs-MoE: Router drift mitigation + expert isolation (but uses subspace methods, not OT or statistical signatures)
-- LDC/FDC: Drift compensation (but single backbone, no expert routing)
-### GAP 4: Anti-Invasion Loss in MoE-based CL (MODERATE-HIGH NOVELTY)
-While drift compensation exists widely, the concept of an **anti-invasion loss** — explicitly preventing new task feature distributions from encroaching on old task knowledge regions in the shared backbone — is relatively unique when combined with MoE routing. Most drift compensation works operate on a single model; applying it specifically to the **shared backbone** in a modular architecture while letting the experts handle task-specific adaptation is novel.
----
-## III. RISK AREAS: Where Novelty Might Be Challenged
-### RISK 1: GrMoE (Grassmannian MoE) — **MEDIUM-HIGH RISK**
-**Paper**: arXiv:2602.17798 (Feb 2026)
-**Why risky**: Uses Matrix Bingham distributions on Grassmannian manifolds for routing — this is statistical-distribution-based routing, the closest conceptual cousin to your idea.
-**Mitigation**: (a) GrMoE is NOT for continual learning, (b) Bingham controls routing entropy, not knowledge characterization, (c) no drift/anti-invasion mechanisms. Your work must clearly differentiate the "signature" interpretation from the "routing control" interpretation.
-### RISK 2: Selective Sinkhorn Routing (SSR) — **MEDIUM RISK**
-**Paper**: arXiv:2511.08972 (Nov 2025)
-**Why risky**: Already formulates token-to-expert as OT using Sinkhorn.
-**Mitigation**: SSR uses OT for load-balancing only — your OT formulation uses distributional distances as cost, making it fundamentally different in semantics.
-### RISK 3: SCDEM — **MEDIUM RISK**
-**Paper**: arXiv:2504.10561 (Apr 2025)
-**Why risky**: Uses OT distance + dynamic expert expansion in CL. Has Feature Distribution Consistency (FDC) via OT.
-**Mitigation**: SCDEM uses OT for alignment between old/new features (preservation), NOT for routing decisions. The routing in SCDEM is separate from the OT component.
-### RISK 4: PASs-MoE + CaRE — **LOW-MEDIUM RISK**
-**Papers**: arXiv:2601.13020, arXiv:2602.03473 (Jan-Feb 2026)
-**Why risky**: Active area of research on CL + MoE routing with drift considerations.
-**Mitigation**: These use learned subspace methods (PAS) and bi-level routing (task-router + expert-router), not distribution-matching OT.
-### RISK 5: vMF for Online CL — **LOW RISK**
-**Paper**: arXiv:2306.03364 (AAAI 2024)
-**Why risky**: Same statistical tool (vMF) same domain (CL).
-**Mitigation**: Uses vMF as training loss, not as module knowledge signature. No MoE, no routing.
----
-## IV. OVERALL NOVELTY ASSESSMENT
-### Rating: **HIGH (with specific caveats)**
-### Justification:
-**Strengths of novelty:**
-1. **No existing paper** combines statistical knowledge signatures + OT-based distribution-matching routing + backbone anti-drift in a unified CL framework. The **three-way integration** is clearly novel.
-2. **The "knowledge signature" concept** — using rich statistical distributions (vMF, Bingham, GMM) to create interpretable fingerprints of what each expert module has learned — is a genuinely new formulation. Existing works use distributions either for training losses or for routing entropy control, but not as descriptive signatures of module competence.
-3. **OT for distribution-matching routing** (as opposed to load-balancing) is a new semantic interpretation of OT in the MoE context. Using distributional distances in the cost matrix of the transport problem is novel.
-4. **Anti-invasion loss for shared backbone** in a modular CL architecture (protecting old task regions while allowing new learning) is novel as a combination — though drift compensation alone is well-studied.
-**Caveats:**
-1. **GrMoE (Feb 2026)** is the closest risk — a reviewer familiar with GrMoE might see conceptual similarity in "statistical distributions for routing." You MUST clearly explain why knowledge signatures ≠ routing entropy control.
-2. **SSR (Nov 2025)** + **BASE Layers** have established OT for MoE routing — you need to clearly differentiate cost matrix semantics.
-3. The field of **MoE for CL** is extremely active (12+ papers in 2025-2026 alone). Given the fast pace, there's a ~15-20% risk that a similar combined idea could appear before submission.
-**Recommended positioning:**
-Frame as: *"First unified framework that creates interpretable statistical knowledge signatures for expert modules and uses Optimal Transport not for load balancing but for semantically-grounded distribution-matching routing in continual learning, complemented by backbone anti-drift protection."*
----
-**Summary Table:**
-| Component | Individual Novelty | Closest Overlap | Risk Level |
-|-----------|-------------------|-----------------|------------|
-| Statistical Knowledge Signatures | **High** | vMF for Online CL (AAAI'24), GrMoE (Feb'26) | Medium |
-| OT as Distribution-Matching Routing | **High** | SSR (Nov'25), BASE Layers (ICML'21) | Medium |
-| Backbone Anti-Drift + Anti-Invasion | **Medium** | LDC (ECCV'24), Cascaded FDC (2022) | Low-Medium |
-| **Three-Component Integration** | **Very High** | SCDEM (Apr'25), PASs-MoE (Jan'26) | Low |

human_working_IdeaMethod_and_discuss/proposal_gainlora_upgrade.md DELETED Viewed

@@ -1,305 +0,0 @@
-# Proposal: OT-SIGN — Statistical Signatures + Optimal Transport Routing for GainLoRA
----
-## PHẦN 0: XÁC MINH KHẢO SÁT (Survey Verification)
-**Kết quả: ✅ Toàn bộ thông tin khảo sát chính xác. Không cần sửa.**
-| Paper | arXiv ID | Xác minh | Mô tả trong survey |
-|-------|----------|---------|-------------------|
-| Grassmannian MoE | 2602.17798 | ✅ Tồn tại | "Bingham distribution trên Grassmannian để control routing entropy" → ĐÚNG. Không phải CL. |
-| Selective Sinkhorn Routing (SSR) | 2511.08972 | ✅ Tồn tại | "OT cho load-balancing token-to-expert" → ĐÚNG. Không phải distribution-matching. |
-| Continual Pre-training of MoEs | 2503.05029 | ✅ Tồn tại | "Sinkhorn-balanced routing trong CPT context" → ĐÚNG. Nghiên cứu robustness của router, không phải CL với signature. |
-| SCDEM | 2504.10561 | ✅ Tồn tại | "OT cho feature alignment (FDC), không phải routing" → ĐÚNG. Tên đầy đủ: Self-Controlled Dynamic Expansion Model. |
-**Kết luận**: Bốn novelty gaps trong `novelty_search_report.md` vẫn giữ nguyên giá trị. Không có paper nào combine statistical signatures + OT distribution-matching routing + backbone anti-drift trong CL.
----
-## PHẦN 1: VẤN ĐỀ CỦA GAINLORA HIỆN TẠI
-### 1.1 Kiến trúc Gating Hiện Tại (từ `t5_gainlora_inflora.py`)
-GainLoRA dùng cơ chế routing **key-query cosine attention**:
-```
-Bước 1: avg_inputs_embeds = weighted_mean(token_embeddings)  # shape (B, 1, d)
-Bước 2: x = trans_input(avg_inputs_embeds)                   # 2-layer MLP → (B, 1, d)
-         x = normalize(x)                                     # unit sphere
-Bước 3: score_t = cosine_sim(x, prompt_key_t)                # scalar per task
-         weight_t = |sigmoid(4 * score_t) * 2 - 1|
-Bước 4: agg_lora = Σ_t  weight_t * lora_t(hidden_states)    # weighted sum
-```
-Với:
-- `prompt_key_t ∈ R^d`: vector học được cho task t (learnable)
-- `trans_input`: MLP 2 lớp (d → mlp_hidden → d, activation SiLU)
-### 1.2 Ba Vấn Đề Cốt Lõi
-**Vấn đề 1 — Routing không có nền tảng phân phối (Non-distributional routing)**
-`prompt_key_t` là một **điểm trong không gian** (point estimate), không phải một **phân phối** trên không gian kiến thức của task t. Điều này có nghĩa:
-- Routing chỉ đo khoảng cách đến một điểm đặc trưng duy nhất
-- Không capture được độ rải hay hình dạng của không gian kiến thức (có task có features trải rộng, có task tập trung)
-- Inputs ở boundary giữa hai tasks không được phân bổ một cách có nguyên tắc
-**Vấn đề 2 — Gating weights không đảm bảo global optimality**
-`weight_t = |sigmoid(4 * cos_sim) * 2 - 1|` là một hàm monotone **local** trên mỗi cặp (input, task). Không có ràng buộc global nào đảm bảo assignment là optimal trên toàn bộ batch hay toàn bộ expert set. Điều này dẫn đến:
-- Expert utilization không balanced (một số LoRA experts bị underused)
-- Không có theoretical guarantee về assignment quality
-**Vấn đề 3 — Backbone drift không được kiểm soát tường minh**
-Trong quá trình huấn luyện sequential, `trans_input` (MLP xử lý input) bị update cho task hiện tại nhưng không có cơ chế bảo vệ. Sau khi học $K$ tasks:
-- `trans_input` có thể drift xa khỏi input features của các tasks cũ
-- `prompt_key` của các tasks cũ được học cùng với `trans_input` cũ → bị misaligned với `trans_input` mới
-- Kết quả: routing của tasks cũ kém chính xác dù LoRA weights vẫn được preserve
-**Vấn đề 4 — Các experts không ngang hàng (Non-parallel feature spaces)**
-Đây là vấn đề kiến trúc sâu hơn, ẩn trong cách GainLoRA xây dựng `past_x` (line 1305 của `t5_gainlora_inflora.py`):
-```python
-past_x = torch.cat([x, self.previous_trans_input(avg_inputs_embeds)], dim=1)
-#                   ↑current task           ↑ N frozen snapshots (task_0, task_1, ...)
-key_attention_weights = self.cal_attention(past_prompt_key, past_x)
-```
-`previous_trans_input` là một module chứa $t-1$ MLP riêng biệt, mỗi cái là **snapshot frozen tại thời điểm task đó được train**. Kết quả:
-| Expert | Feature extractor | Feature space |
-|--------|-----------------|--------------|
-| Task 0 | `trans_input_frozen_at_t=0` | $\mathcal{F}_0$ |
-| Task 1 | `trans_input_frozen_at_t=1` | $\mathcal{F}_1$ |
-| Task $t$ (current) | `trans_input` (đang update) | $\mathcal{F}_t$ |
-Routing tính **cosine similarity** giữa các vectors từ $N$ không gian khác nhau $\mathcal{F}_0, \mathcal{F}_1, \ldots, \mathcal{F}_t$ — so sánh này không có ý nghĩa hình học nhất quán. `prompt_key_i` được học trong $\mathcal{F}_i$ nhưng được dùng trong routing tại $\mathcal{F}_t$ → experts được đánh giá không công bằng, không phải do knowledge match mà do feature space mismatch. Thêm vào đó, memory overhead tăng tuyến tính: 15 tasks → 15 bản sao MLP.
----
-## PHẦN 2: ĐỀ XUẤT CẢI TIẾN (GainLoRA → OT-SIGN)
-### 2.1 Tổng Quan
-Thay thế ba điểm yếu trên bằng ba thành phần tương ứng:
-| Vấn đề | GainLoRA Hiện Tại | OT-SIGN Đề Xuất |
-|--------|------------------|-----------------|
-| Point routing | `prompt_key_t ∈ R^d` | vMF signature `(μ_t, κ_t)` |
-| Local scoring | cosine sim → sigmoid | OT cost = vMF log-likelihood → Sinkhorn |
-| No backbone protection | Không có | Anti-drift + Anti-invasion loss |
-| Non-parallel experts | $N$ frozen `previous_trans_input` snapshots | 1 `trans_input` chung + signatures cùng không gian |
-### 2.2 Component 1 — vMF Knowledge Signatures
-**Thay thế `prompt_key_t ∈ R^d` bằng von Mises-Fisher signature `(μ_t, κ_t)`**
-Sau khi huấn luyện xong task $t$, chạy một lần qua training data để collect:
-$$\mu_t = \frac{\bar{x}_t}{\|\bar{x}_t\|}, \qquad \kappa_t = \frac{\bar{r}(d-1) - \bar{r}^3}{1 - \bar{r}^2}$$
-với $\bar{x}_t = \mathbb{E}[\text{trans\_input}(x)]$ (mean direction sau MLP) và $\bar{r} = \|\bar{x}_t\|$ (mean resultant length). Đây là ước lượng MLE chuẩn của vMF (Banerjee et al., 2005).
-**Tại sao vMF?**
-- Features sau `normalize(trans_input(x))` nằm trên đơn vị hypersphere $\mathcal{S}^{d-1}$ → đúng domain của vMF
-- vMF capture cả **hướng** (μ: trung tâm kiến thức) và **độ tập trung** (κ: task có diverse inputs có κ nhỏ, task tập trung có κ lớn)
-- Chỉ lưu thêm $d + 1$ scalars so với $d$ scalars hiện tại (minimal overhead)
-**Code integration** — thêm vào end-of-task hook trong `cl_trainer_gainlora_inflora.py`:
-```python
-def compute_vmf_signature(self, dataloader, model, task_id):
-    """Chạy sau training mỗi task để fit vMF signature."""
-    model.eval()
-    all_x = []
-    with torch.no_grad():
-        for batch in dataloader:
-            avg_emb = (batch['attention_mask'].unsqueeze(-1) *
-                       model.encoder.embed_tokens(batch['input_ids'])).mean(dim=1, keepdim=True)
-            medium = model.encoder.trans_input[1](model.encoder.trans_input[0](avg_emb))
-            x = model.encoder.trans_input[3](model.encoder.trans_input[2](medium))
-            x = F.normalize(x.squeeze(1), dim=-1)  # (B, d)
-            all_x.append(x)
-    all_x = torch.cat(all_x, dim=0)
-    x_bar = all_x.mean(0)                                    # (d,)
-    r_bar = x_bar.norm()                                     # scalar
-    mu_t = F.normalize(x_bar, dim=-1)                        # mean direction
-    kappa_t = r_bar * (model.config.d_model - 1 - r_bar**2) / (1 - r_bar**2)
-    model.encoder.vmf_signatures[task_id] = (mu_t.detach(), kappa_t.detach())
-```
-### 2.3 Component 2 — OT Distribution-Matching Routing
-**Thay thế `cal_attention` (cosine sim) bằng Sinkhorn-OT với cost = vMF log-likelihood**
-Với input feature $x_b$ (sau `trans_input`, normalized) và $N$ task signatures, tính cost matrix:
-$$C_{bt} = -\kappa_t \cdot (\mu_t \cdot x_b) \quad \in \mathbb{R}^{B \times N}$$
-(negative log-likelihood của vMF, bỏ constant term)
-Sau đó chạy Sinkhorn OT (entropic regularization, $\varepsilon = 0.05$, 10 iterations):
-$$\Pi^* = \text{Sinkhorn}(C, \varepsilon), \quad \Pi^* \in \mathbb{R}^{B \times N}, \quad \Pi^* \mathbf{1} = \mathbf{1}/B$$
-`key_attention_weights` = $\Pi^* \in \mathbb{R}^{B \times 1 \times N}$ → đưa vào `agg_lora_states` y chang hiện tại.
-**Code integration** — thay hàm `cal_attention` trong `T5Stack`:
-```python
-def cal_attention_ot(self, x, task_id=None):
-    """
-    x: (B, 1, d) — normalized input features
-    Returns OT transport weights: (B, N_tasks, 1)
-    """
-    x = x.squeeze(1)  # (B, d)
-    N = len(self.vmf_signatures)
-    # Build cost matrix via vMF log-likelihood
-    # C[b,t] = -kappa_t * (mu_t · x_b)
-    mu_stack = torch.stack([sig[0] for sig in self.vmf_signatures.values()], dim=0)   # (N, d)
-    kappa_stack = torch.tensor([sig[1] for sig in self.vmf_signatures.values()])       # (N,)
-    kappa_stack = kappa_stack.to(x.device, dtype=x.dtype)
-    dot_products = x @ mu_stack.T      # (B, N)
-    C = -kappa_stack.unsqueeze(0) * dot_products   # (B, N)  — cost matrix
-    # Sinkhorn iterations (log-domain for stability)
-    weights = sinkhorn_log(C, epsilon=0.05, n_iter=10)  # (B, N)
-    return weights.unsqueeze(2)  # (B, N, 1)  — same shape as current key_attention_weights
-def sinkhorn_log(C, epsilon=0.05, n_iter=10):
-    """Log-domain Sinkhorn — numerically stable."""
-    log_a = torch.zeros(C.shape[0], device=C.device)  # uniform source (log 1/B)
-    log_b = torch.zeros(C.shape[1], device=C.device)  # uniform target (log 1/N)
-    log_K = -C / epsilon
-    u = torch.zeros_like(log_a)
-    for _ in range(n_iter):
-        u = log_a - torch.logsumexp(log_K + u.unsqueeze(1), dim=1)
-    v = log_b - torch.logsumexp(log_K + u.unsqueeze(1), dim=0)
-    log_pi = log_K + u.unsqueeze(1) + v.unsqueeze(0)
-    return log_pi.exp() * C.shape[1]  # normalize to sum=1 per row (B, N)
-```
-**Tại sao OT tốt hơn cosine sim?**
-- Cost matrix encode "khoảng cách phân phối" — inputs gần vùng kiến thức task nào thì được route nhiều hơn đến task đó
-- Sinkhorn constraints đảm bảo **global optimal assignment** trên cả batch
-- OT weights tự nhiên sum to 1 → không cần normalization ad-hoc như `|sigmoid(...)*2-1|`
-- Differentiable → gradients vẫn flow qua weights đến `trans_input` MLP
-### 2.4 Component 3 — Backbone Anti-Drift Loss
-**Thêm hai penalty terms vào training loop của mỗi task mới**
-**Anti-drift loss** — bảo vệ `trans_input` khỏi drift trên replay data:
-$$\mathcal{L}_{\text{drift}} = \frac{1}{|\mathcal{B}_{\text{replay}}|} \sum_{x \in \mathcal{B}_{\text{replay}}} \left\| \text{trans\_input}(x) - \text{trans\_input}_{\text{ref}}(x) \right\|^2$$
-với `trans_input_ref` là frozen snapshot của `trans_input` sau nhiệm vụ $t-1$.
-**Anti-invasion loss** — ngăn features của task mới "xâm chiếm" vùng của task cũ trong feature space:
-$$\mathcal{L}_{\text{inv}} = \sum_{s < t} \max\left(0,\ \kappa_s \cdot (\mu_s \cdot x_{\text{new}}) - \tau \right)$$
-với $x_{\text{new}}$ là features của task hiện tại, $(\mu_s, \kappa_s)$ là signature của task cũ $s$, và $\tau$ là threshold (VD: $\tau = -\log(0.1)$). Hàm này phạt khi features task mới có high likelihood dưới signature của task cũ.
-**Tổng loss function:**
-$$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{CE}} + \lambda_{\text{KL}} \mathcal{L}_{\text{KL}} + \lambda_{\text{drift}} \mathcal{L}_{\text{drift}} + \lambda_{\text{inv}} \mathcal{L}_{\text{inv}}$$
-($\mathcal{L}_{\text{KL}}$ là replay loss đã có trong GainLoRA)
-**Code integration** — trong `compute_loss` của `cl_trainer_gainlora_inflora.py`:
-```python
-# Anti-drift (thêm sau replay KL loss)
-if self.args.anti_drift and self.ref_trans_input is not None:
-    replay_avg = (replay_mask.unsqueeze(-1) * self.model.encoder.embed_tokens(replay_ids)).mean(1)
-    x_curr = self.model.encoder.trans_input(replay_avg)  # F.normalize inside
-    with torch.no_grad():
-        x_ref = self.ref_trans_input(replay_avg)
-    drift_loss = self.args.lambda_drift * F.mse_loss(x_curr, x_ref)
-    loss = loss + drift_loss
-# Anti-invasion (thêm với current task batch)
-if self.args.anti_invasion and hasattr(self.model.encoder, 'vmf_signatures'):
-    x_new = F.normalize(self.model.encoder.trans_input(avg_emb_curr), dim=-1)
-    invasion_loss = 0.0
-    for t_id, (mu_s, kappa_s) in self.model.encoder.vmf_signatures.items():
-        if t_id < self.current_task_id:
-            log_lik = kappa_s * (mu_s @ x_new.T).mean()
-            invasion_loss += F.relu(log_lik - self.args.invasion_threshold)
-    loss = loss + self.args.lambda_inv * invasion_loss
-```
----
-## PHẦN 3: ĐÁNH GIÁ KHẢ THI (Feasibility Assessment)
-### 3.1 Tại Sao GainLoRA Là Candidate Tốt Nhất
-Dựa vào code phân tích (`t5_gainlora_inflora.py`, `cal_attention`, `agg_lora_states`):
-| Yếu tố | Đánh giá | Chi tiết |
-|--------|---------|---------|
-| Feature space đã normalized | ✅ Hoàn hảo | `x = x/x.norm()` ở line 1210 → trực tiếp trên $\mathcal{S}^{d-1}$ → vMF domain |
-| Gating có weights scalar | ✅ Dễ thay | `key_attention_weights (B, N, 1)` feed vào `agg_lora_states` → chỉ cần output cùng shape |
-| Multi-task keys structure | ✅ Sẵn có | `previous_prompts_keys` (N, d) → thay bằng `vmf_signatures dict` |
-| Sequential training loop | ✅ Rõ ràng | End-of-task hook có thể thêm vào `cl_trainer` sau `save_model()` |
-| lora_r=4 nhỏ | ✅ Không ảnh hưởng | Signature fit trên `trans_input` output (d=1024), không phải trên r=4 space |
-| Memory overhead | ✅ Giảm đáng kể | Loại bỏ `previous_trans_input` (~15 × MLP size), thay bằng 15 × (d+1) floats cho signatures |
-| Non-parallel expert problem | ✅ Giải quyết hoàn toàn | Loại bỏ `previous_trans_input`: tất cả experts dùng cùng `trans_input` → cùng feature space $\mathcal{S}^{d-1}$ |
-| Sinkhorn on T4 | ✅ Khả thi | k=15 tasks, B=8, 10 iterations → <1ms/forward pass |
-| Differentiable | ✅ | Log-domain Sinkhorn có gradients → không cần thay optimizer |
-### 3.2 Thay Đổi Tối Thiểu Cần Làm
-Chỉ cần modify **3 chỗ** trong codebase GainLoRA:
-1. **`t5_gainlora_inflora.py → T5Stack.__init__`**: Thay `self.prompt_key` bằng `self.vmf_signatures = {}` + thêm `cal_attention_ot()` + `sinkhorn_log()`
-2. **`t5_gainlora_inflora.py → T5Stack.forward`**: Thay `self.cal_attention(...)` bằng `self.cal_attention_ot(x)` sau khi signatures được loaded
-3. **`cl_trainer_gainlora_inflora.py`**: Thêm `compute_vmf_signature()` call cuối mỗi task + thêm drift/invasion losses trong `compute_loss()`
-Giữ nguyên hoàn toàn:
-- `LoRALayer`, `agg_lora_states`, InfLoRA SVD projection
-- KL distillation loss (replay)
-- `trans_input` MLP architecture
-- `previous_lora_weights_*` mechanism
-- DeepSpeed / training infrastructure
-### 3.3 Rủi Ro Thực Thi
-| Rủi ro | Mức độ | Giải pháp |
-|--------|--------|----------|
-| κ estimation unstable (κ → 0 hoặc ∞) | Medium | Clip κ ∈ [0.1, 50]; fallback to cosine routing khi κ < 0.5 |
-| Sinkhorn không converge với ε quá nhỏ | Low | Dùng ε = 0.05–0.1; log-domain stable |
-| Anti-drift quá mạnh → catastrophic underfitting | Medium | Schedule λ_drift decreasing, bắt đầu từ 0.01 |
-| vMF fit trên lora_r=4 features (nếu fit ở wrong level) | Low | **Fit trên trans_input output (d=1024), không phải LoRA factors** |
-| T5-Large + 15 tasks + signatures + Sinkhorn OOM | Low | Signatures chỉ 15×1025 floats ≈ 60KB; Sinkhorn là matrix ops không grow model size |
----
-## PHẦN 4: TÓM TẮT ĐÓNG GÓP
-### Điểm Khác Biệt So Với Các Paper Liên Quan
-| Paper gần nhất | Điểm khác biệt |
-|-----------|----------------|
-| GrMoE (2602.17798) | GrMoE: Bingham kiểm soát **routing entropy** (sparsity). OT-SIGN: vMF mô tả **knowledge region** của expert. GrMoE không phải CL, không có anti-invasion. |
-| SSR (2511.08972) | SSR: OT cho **load balancing** (cost = learned linear score). OT-SIGN: OT cho **distribution matching** (cost = vMF log-likelihood). Semantics hoàn toàn khác. |
-| SCDEM (2504.10561) | SCDEM: OT cho **feature alignment** giữa epochs (FDC). OT-SIGN: OT như **routing mechanism** để chọn expert. |
-| PASs-MoE (2601.13020) | PASs-MoE: subspace methods cho router alignment. OT-SIGN: statistical signatures + global OT assignment. |
-### Contribution Claim
-> *OT-SIGN là framework đầu tiên sử dụng von Mises-Fisher distributions như fingerprint của knowledge region của từng expert module trong modular continual learning, đồng thời thay thế heuristic gating bằng Optimal Transport với semantic cost matrix (vMF log-likelihood), kết hợp với anti-drift và anti-invasion losses để bảo vệ shared representation space.*
----
-*Analysis date: based on GainLoRA codebase + survey verification against arXiv 2024-2026*

human_working_IdeaMethod_and_discuss/research_rule.txt DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:fb1778bc49b36013798b859362b5419e346aba817bcdca2d6c798860a2dd6a46
-size 963

human_working_IdeaMethod_and_discuss/revised_idea_analysis.md DELETED Viewed

@@ -1,485 +0,0 @@
-# PHÂN TÍCH CHUYỂN HƯỚNG IDEA: Từ Data-Level Signatures → Module-Level Signatures
-## Comprehensive Analysis Report
-**Date**: March 7, 2026
-**Context**: Idea cũ (OT-SIGN) vi phạm zero-replay setting → cần chuyển hướng
----
-# PHẦN 1: XÁC NHẬN VI PHẠM VÀ CHUYỂN HƯỚNG
-## 1.1 Hai Điểm Vi Phạm Của Idea Cũ
-### Vi phạm 1: vMF trên dữ liệu cũ = replay thống kê
-**Phân tích chính xác**: Setting zero-replay (GainLoRA Section 2.2, InfLoRA Section 2.2) yêu cầu:
-> "The model must learn without access to real or synthetic samples from previously learned tasks."
-Việc fit vMF signature $(μ_t, κ_t)$ cuối mỗi task yêu cầu **chạy forward pass qua training data của task cũ** để collect features → đây chính là *statistical summary* của old data distribution → **vi phạm zero-replay**.
-**Bằng chứng từ papers**: Tất cả LoRA-based CL papers trong survey (GainLoRA, InfLoRA, O-LoRA, C-LoRA, MINGLE) đều không lưu bất kỳ thông tin thống kê nào về dữ liệu cũ. Cái duy nhất được phép lưu là:
-- **Parameter weights** (frozen LoRA A, B matrices)
-- **Subspace bases** (GPM/DualGPM bases $M_t$ — đây là basis vectors, KHÔNG phải data statistics)
-- **Gating module weights** (trans_input MLP weights)
-Ranh giới tinh tế: GPM bases $M_t$ được tính từ input covariance $H_t^T H_t$ — thoạt nhìn giống "thống kê", nhưng được chấp nhận vì $M_t$ chỉ capture **directions (subspace)**, KHÔNG capture **distribution parameters** (mean, concentration, shape). Nó tương đương với **memory of which directions were used**, không phải **memory of what data looked like**.
-### Vi phạm 2: Anti-invasion loss không cần thiết
-**Phân tích chính xác**: Trong kiến trúc LoRA expandable:
-- **InfLoRA**: $B_t$ được thiết kế trong $\mathcal{N}_t \cap \mathcal{M}_t^{\perp}$ — intersection của gradient space new task và null-space of old tasks. Điều này **mathematically guarantees** rằng update cho task mới nằm trong subspace trực giao với old tasks.
-- **OLoRA**: Soft penalty $\lambda_1 \sum_{t'<t} \|A_{t'} A_t^T\|_1$ khuyến khích A matrices trực giao nhau.
-- **GainLoRA**: Constraints (5)+(7) trên gating module đảm bảo $g_t(x) = 0$ cho old task inputs.
-→ Với các cơ chế này, LoRA branches **đã được thiết kế** để không xâm lấn lẫn nhau → thêm anti-invasion loss là **dư thừa** và vi phạm Occam's razor.
-## 1.2 Hướng Đi Mới: Khai Thác Thông Tin Từ Module LoRA
-**Ý tưởng mới**: Thay vì khái quát phân phối dữ liệu cũ, khai thác thông tin (thống kê, hình học) **nội tại** của các LoRA submodules — tức là phân tích chính các ma trận $A_t, B_t$ — làm signature cho routing.
-**Tại sao hợp lệ?** Vì $A_t, B_t$ là **model parameters**, không phải data. Chúng đã được frozen sau khi train và là phần tự nhiên của model → việc phân tích chúng KHÔNG vi phạm zero-replay.
----
-# PHẦN 2: KHẢO SÁT SETTINGS VÀ PAPERS LIÊN QUAN
-## 2.1 Các Papers Cùng Settings (Zero-Replay, LoRA-Expansion, Task-ID-Free)
-| Paper | Venue | LoRA Constraint | Routing | Lưu gì từ old tasks? |
-|-------|-------|----------------|---------|---------------------|
-| **InfLoRA** [Liang & Li, CVPR'24] | CVPR 2024 | Hard: $B_t$ in $\mathcal{N}_t \cap \mathcal{M}_t^{\perp}$ | Không có routing (merge tất cả) | GPM bases $M_t$ |
-| **O-LoRA** [Liang & Li] | Cùng nhóm InfLoRA | Random init, CE loss only | Merge tất cả ($a_i = 1$) | Không gì thêm |
-| **C-LoRA** [Smith et al., 2023] | CoRR | Soft: null-space regularization | Merge tất cả | Null-space directions |
-| **GainLoRA** [Liang et al., NeurIPS'25] | NeurIPS 2025 | Kế thừa InfLoRA/OLoRA | **Gating: cosine sim → sigmoid** | GPM bases + frozen trans_input snapshots |
-| **MINGLE** [Qiu et al., NeurIPS'25] | NeurIPS 2025 | Entropy-based null-space SVD | **MoE gating: FC → softmax** | Input covariance SVD subspace $U$ |
-| **CLoRA** [ACL'25] | ACL 2025 | Null space regularization trên output matrix | Merge tất cả | Null-space directions |
-| **TreeLoRA** [ICML'25] | ICML 2025 | No explicit orthogonality | **Gradient-similarity tree routing** | Gradient similarity scores |
-| **PLAN** [ICCV'25] | ICCV 2025 | Orthogonal basis allocation per task | Perturbation-based selection | Orthonormal basis set |
-| **Feature Distributions** [ICML'25] | ICML 2025 | No explicit orthogonality | **Mean feature vector matching** | Mean feature vectors per PEFT block |
-| **SD-LoRA** [ICLR'25] | ICLR 2025 | Decoupled magnitude/direction | Low-loss trajectory | Direction/magnitude decomposition |
-### Nhận xét quan trọng:
-1. **Tất cả** papers trong settings này đều KHÔNG lưu data statistics (vMF, covariance, GMM) từ old tasks
-2. Routing mechanisms hiện tại: cosine similarity (GainLoRA), FC gating (MINGLE), gradient similarity (TreeLoRA), mean features (Feature Distributions) — **chưa có paper nào dùng LoRA weight properties làm routing signatures**
-3. Paper gần nhất concept: **Feature Distributions** (ICML'25) dùng mean feature vector → nhưng đây là feature-level, KHÔNG phải weight-level
-## 2.2 Thông Tin Gì Được Phép Khai Thác?
-Theo zero-replay setting, ta chỉ được phép khai thác:
-| Nguồn | Ví dụ | Hợp lệ? |
-|-------|-------|---------|
-| Frozen model weights | $A_t, B_t$ matrices, gating weights | ✅ Hoàn toàn |
-| Subspace bases từ GPM | $M_t, M_t^{\perp}$ | ✅ (đã được InfLoRA sử dụng) |
-| Pre-trained model weights | Base $W$ | ✅ |
-| Current task data | $\mathcal{D}_t$ (chỉ task đang train) | ✅ |
-| Old task data/statistics | vMF, mean, covariance | ❌ Vi phạm |
----
-# PHẦN 3: LoRA MODULES — TÍNH CHẤT VÀ ĐẶC TRƯNG CÓ THỂ KHAI THÁC
-## 3.1 LoRA Module Là Gì?
-Mỗi LoRA branch cho task $t$ gồm:
-- $B_t \in \mathbb{R}^{r \times d_{in}}$ : **Dimensionality reduction matrix** (mã hóa input subspace)
-- $A_t \in \mathbb{R}^{d_{out} \times r}$ : **Dimensionality increasing matrix** (được fine-tuned, mã hóa task-specific transformation)
-Với GainLoRA: $r = 4$, $d_{in} = d_{out} = 1024$ (T5-Large).
-**Ý nghĩa hình học:**
-- Mỗi **hàng** của $B_t$ ($b_i^t \in \mathbb{R}^{d_{in}}$) là một **direction vector** trong input space
-- $\text{span}\{b_1^t, \ldots, b_r^t\}$ định nghĩa **subspace mà task $t$ hoạt động trong**
-- $A_t B_t$ = rank-$r$ perturbation lên weight matrix $W$ → task-specific **adaptation direction**
-**Fact quan trọng (Proposition 1 từ InfLoRA)**:
-> Fine-tuning $A_t$ is equivalent to fine-tuning the pre-trained weight $W$ within the subspace $\text{span}\{b_1^t, \ldots, b_r^t\}$.
-→ **$B_t$ hoàn toàn đặc trưng cho "vùng hoạt động" (operating subspace) của task $t$**
-## 3.2 Đặc Trưng Hình Học Của LoRA Modules
-### a) Singular Value Decomposition (SVD) của $A_t B_t$
-$$A_t B_t = U_t \Sigma_t V_t^T$$
-Trong đó:
-- $U_t \in \mathbb{R}^{d_{out} \times r}$: **Output directions** — các hướng mà task $t$ "phát ra" trong output space
-- $\Sigma_t = \text{diag}(\sigma_1^t, \ldots, \sigma_r^t)$: **Singular values** — "strength/importance" của từng direction
-- $V_t \in \mathbb{R}^{d_{in} \times r}$: **Input directions** — subspace mà task $t$ "lắng nghe" trong input space
-**Tính chất:**
-1. **Singular values $\sigma_i^t$** reflect relative importance of each direction for task $t$
-2. **Right singular vectors $v_i^t$** define the input receptive subspace
-3. **Left singular vectors $u_i^t$** define the output emission subspace
-4. **Spectral entropy** $H_t = -\sum_i \hat{\sigma}_i \log \hat{\sigma}_i$ (với $\hat{\sigma}_i = \sigma_i / \sum_j \sigma_j$) measures "spread" of task knowledge across directions
-### b) Grassmann Manifold Perspective
-Collection of $r$-dimensional subspaces trong $\mathbb{R}^{d}$ forms **Grassmann manifold** $\text{Gr}(r, d)$.
-Mỗi LoRA branch task $t$ → một point $\mathcal{V}_t = \text{span}(V_t)$ trên $\text{Gr}(r, d_{in})$ (input side) hoặc $\mathcal{U}_t = \text{span}(U_t)$ trên $\text{Gr}(r, d_{out})$ (output side).
-**Khoảng cách trên Grassmannian** giữa hai tasks:
-$$d_G(\mathcal{V}_i, \mathcal{V}_j) = \|\theta\|_2 = \sqrt{\sum_{k=1}^r \theta_k^2}$$
-Với $\theta_k = \arccos(\sigma_k)$ là **principal angles** giữa hai subspaces, tính từ SVD của $V_i^T V_j$.
-**Ý nghĩa**: Tasks có subspaces gần nhau (small Grassmann distance) → likely share knowledge → routing nên fuse chúng. Tasks có subspaces xa nhau → independent knowledge → routing nên chọn riêng.
-### c) Column Space và Row Space
-- **Column space** of $\Delta W_t = A_t B_t$: $\text{col}(\Delta W_t) = \text{span}(U_t)$ → **output feature subspace** task $t$ tác động
-- **Row space** of $\Delta W_t$: $\text{row}(\Delta W_t) = \text{span}(V_t)$ → **input feature subspace** task $t$ sử dụng
-- **Null space** of $\Delta W_t$: inputs mà task $t$ **không hề affect** → orthogonal complement of row space
-### d) Frobenius Norm và Spectral Properties
-$$\|A_t B_t\|_F = \sqrt{\sum_i (\sigma_i^t)^2}$$
-Measures overall "magnitude" của task $t$'s adaptation. Phân phối singular values cho biết:
-- **Concentrated** ($\sigma_1 \gg \sigma_2 \gg \ldots$): Task có dominant direction → knowledge tập trung
-- **Spread** ($\sigma_1 \approx \sigma_2 \approx \ldots$): Task cần nhiều directions → knowledge phân tán
-## 3.3 Công Cụ Thống Kê/Hình Học Phù Hợp
-| Đặc trưng | Công cụ | Ý nghĩa |
-|-----------|---------|---------|
-| Subspace direction | Grassmann manifold, principal angles | Đo "task relatedness" dựa trên góc giữa subspaces |
-| Singular value distribution | Spectral entropy, effective rank | Đo "complexity/spread" của task knowledge |
-| Weight matrix geometry | Frobenius/Nuclear/Spectral norm | ��o "magnitude" của task adaptation |
-| Subspace overlap | $\text{Tr}(P_i P_j)$ với $P_i = V_i V_i^T$ projection | Đo mức chồng chéo giữa operating subspaces |
-| Fisher Information | $F_t = \mathbb{E}[\nabla \log p \cdot \nabla \log p^T]$ | Parameter importance (nhưng cần data → vi phạm nếu dùng old task data) |
-**Lưu ý quan trọng**: Tất cả metrics trên chỉ yêu cầu **ma trận $A_t, B_t$** (frozen weights), KHÔNG cần old data → **hoàn toàn hợp lệ** trong zero-replay setting.
----
-# PHẦN 4: PHÂN TÍCH VẤN ĐỀ TRỰC GIAO — SUBSPACE EXHAUSTION
-## 4.1 Vấn Đề: Subspace Shrinkage (Nhận Định Đúng)
-Nhận định của bạn **hoàn toàn chính xác** và được xác nhận bởi cả lý thuyết và code:
-### Chứng minh toán học:
-Khi sử dụng GPM/DualGPM (InfLoRA), subspace cho old tasks $\mathcal{M}_t$ **tăng đơn điệu**:
-$$\dim(\mathcal{M}_1) \leq \dim(\mathcal{M}_2) \leq \ldots \leq \dim(\mathcal{M}_T) \leq d_{in}$$
-Do đó, **null-space $\mathcal{M}_t^{\perp}$ giảm đơn điệu**:
-$$\dim(\mathcal{M}_t^{\perp}) = d_{in} - \dim(\mathcal{M}_t)$$
-Kết quả:
-- **Task 1**: Toàn bộ $d_{in}$-dimensional space available → $B_1$ có $d_{in}$ chiều để chọn
-- **Task $t$**: Chỉ còn $\dim(\mathcal{M}_t^{\perp})$ chiều → $B_t$ bị giới hạn trong subspace nhỏ hơn
-- **Task $T$ (final)**: Available space có thể rất nhỏ nếu $T$ lớn
-### Từ code GainLoRA (InfLoRA variant):
-```python
-# Threshold tăng dần → old subspace ĂN nhiều hơn
-threshold = (1.0 - threshold_base) * cur_task / total_sessions + threshold_base
-# threshold_base = 0.995 → threshold tăng từ 0.995 → 1.0
-```
-Quan sát từ InfLoRA paper (Figure 5): dim($\mathcal{M}_t^{\perp}$) giảm nhưng "always much larger than zero". **Tuy nhiên** điều này chỉ đúng cho 20 tasks với $d_{in} = 768$ (ViT-B/16). Với settings khó hơn (T5-Large, $d_{in} = 1024$, 15 tasks, mỗi task tốn nhiều directions), subspace có thể bị **cạn kiệt đáng kể**.
-### Hậu quả: Unfair Capacity Allocation
-| Task | Available dim | Constraint count | Effective capacity |
-|------|--------------|-------------------|-------------------|
-| Task 1 | $d_{in}$ | 0 | Maximum |
-| Task 5 | $d_{in} - \sum_{i=1}^{4} k_i$ | 4 sets | Giảm |
-| Task 15 | $d_{in} - \sum_{i=1}^{14} k_i$ | 14 sets | **Rất nhỏ** |
-Với $k_i$ là dimension được thêm vào $\mathcal{M}$ ở mỗi task (thường $k_i \sim$ rank effective of task $i$).
-**Ví dụ cụ thể**: Nếu mỗi task "chiếm" trung bình 60 dimensions (với threshold 0.995), sau 15 tasks:
-$$\text{claimed} = 15 \times 60 = 900 \quad \text{vs.} \quad d_{in} = 1024$$
-→ Task 15 chỉ còn $\sim 124$ dimensions available → **capacity giảm ~88%** so với task 1.
-## 4.2 Các Hướng Giải Quyết Từ Literature
-### Hướng 1: DualGPM — Slow Expansion (InfLoRA đã dùng)
-- Tăng threshold dần → giảm tốc expansion
-- **Nhược điểm**: Chỉ *chậm lại* depletion, không *giải quyết* root cause. Trade-off: threshold cao → bảo tồn tốt nhưng space hẹp; threshold thấp → space rộng nhưng interference.
-### Hướng 2: Adaptive Relaxation (MINGLE đã dùng)
-- Track alignment history $h_i$ (EMA) giữa gradient và old directions
-- Directions có high historical alignment → **được relaxed** (cho phép update)
-- $\lambda_i = \exp(-\gamma \cdot h_i)$: soft decay thay vì hard projection
-**Ưu điểm**: Không tốn space vĩnh viễn — directions cần thiết cho task hiện tại được "mượn" lại.
-**Nhược điểm**: Có thể gây interference nếu relaxation quá mạnh.
-### Hướng 3: Subspace Recycling / Forgetting Old Bases
-- Ý tưởng: Nếu một direction trong $\mathcal{M}_t$ không còn quan trọng (ví dụ singular value tương ứng rất nhỏ), có thể "giải phóng" nó cho tasks mới.
-- **Chưa có paper nào implement** trong LoRA CL context.
-- Liên quan: "Memory-efficient GPM" directions — nhưng chưa formal.
-### Hướng 4: Shared Subspace Decomposition (Novel Direction)
-- Thay vì hard orthogonal: phân tách mỗi task thành **shared component** + **task-specific component**
-- Shared component được tái sử dụng → không tốn space mới
-- Task-specific component tuân thủ orthogonal → nhưng nhỏ hơn many
-- Related: **Oblique projection** thay vì orthogonal projection
-### Hướng 5: Grassmann Manifold Optimization (Mathematical Foundation)
-Thay vì project trong Euclidean space, tối ưu hóa trên **Grassmann manifold** $\text{Gr}(r, d)$:
-**Stiefel Manifold Constraint**: Thay vì $B_t \perp \text{span}(\text{old bases})$, yêu cầu:
-$$B_t \in \text{St}(r, d_{in}) \quad \text{(Stiefel manifold: orthonormal frames)}$$
-Rồi dùng **Riemannian gradient descent** trên Grassmannian để tìm $B_t$ tối ưu trên manifold — inherently balanced vì mọi point trên Grassmannian có "metric volume" equal.
-**Kết nối toán học**: Geodesic distance trên Grassmannian = principal angles = chính là independence measure giữa subspaces. Tối ưu hóa trên manifold tự nhiên cân bằng capacity.
-## 4.3 Phân Tích OLoRA (Soft Constraint)
-OLoRA dùng soft penalty $\|A_{old} A_{new}^T\|$ thay vì hard projection. Điều này:
-**Ưu điểm**:
-- Không bị subspace exhaustion (penalty dẻo, cho phép small overlap)
-- Capacity allocation công bằng hơn (mọi task đều có toàn bộ space, nhưng bị penalize nếu overlap)
-**Nhược điểm**:
-- Không có **theoretical guarantee** rằng interference = 0
-- Penalty strength $\lambda_1$ cố định → không adaptive theo task complexity
-- Có thể dẫn đến "soft forgetting" nếu overlap tích lũy
-## 4.4 Kết Luận: Cần Một Cơ Chế Mới
-Cả hard (InfLoRA) và soft (OLoRA) đều có significant drawbacks:
-1. **Hard**: Subspace exhaustion, unfair late-task capacity
-2. **Soft**: No guarantee, accumulating interference
-→ Cần **adaptive mechanism** kết hợp ưu điểm cả hai.
----
-# PHẦN 5: ĐÁNH GIÁ HƯỚNG ĐI MỚI VÀ ĐỀ XUẤT CẢI TIẾN
-## 5.1 Đánh Giá Idea Mới (Module-Level Signatures + OT Routing)
-### Điểm mạnh:
-1. **Hoàn toàn hợp lệ**: Chỉ phân tích frozen weights $A_t, B_t$ → zero-replay compliant
-2. **Novel**: KHÔNG có paper nào trong 109 papers khảo sát dùng LoRA weight SVD/spectral properties làm routing signatures
-3. **Well-motivated**: SVD of $A_t B_t$ captures task subspace geometry — mathematically grounded on Grassmann manifold
-4. **Compatible**: Có thể áp dụng trên GainLoRA, MINGLE, và bất kỳ expandable LoRA architecture nào
-### Điểm cần cải tiến:
-1. **OT routing dựa trên gì?**: Cần define rõ cost matrix. Idea cũ: vMF log-likelihood (vi phạm). Idea mới: **Grassmann distance** hoặc **subspace projection similarity** giữa input và LoRA subspaces → hợp lệ.
-2. **Input representation**: Routing cần biết input feature $x$ thuộc "vùng nào". Ta cần map $x$ vào cùng space với LoRA signatures mà KHÔNG dùng old data. Giải pháp: **project $x$ lên mỗi LoRA subspace**, đo "fit" bằng projection magnitude.
-3. **Fairness constraint**: Cần giải quyết subspace exhaustion → đây CÓ THỂ là contribution thứ 2 (thay cho anti-invasion loss).
-## 5.2 Đề Xuất Idea Sơ Thảo: **SpecRoute — Spectral Signatures + Grassmann-Fair Routing**
-### Tổng quan 3 Contributions
-| # | Contribution | Thay thế gì? | Novel? |
-|---|-------------|--------------|--------|
-| C1 | **Spectral LoRA Signatures**: Dùng SVD properties $(U_t, \Sigma_t, V_t)$ của frozen $A_t B_t$ làm task fingerprint | Thay prompt_key (point estimate) bằng rich spectral descriptor | ✅ Novel — chưa có paper nào |
-| C2 | **Grassmann-OT Routing**: OT với cost = Grassmann distance giữa input projection và LoRA subspaces | Thay cosine sim → sigmoid bằng principled OT | ✅ Novel — OT + Grassmann chưa kết hợp trong CL |
-| C3 | **Elastic Subspace Allocation (ESA)**: Cơ chế thay thế hard orthogonal, cho phép controlled sharing + spectral-importance-weighted protection | Thay GPM hard constraint bằng adaptive elastic constraint | ✅ Novel — addresses known limitation |
-### C1: Spectral LoRA Signatures
-**Định nghĩa**: Cho task $t$ đã train, với frozen $A_t, B_t$, tính SVD:
-$$\Delta W_t = A_t B_t = U_t \Sigma_t V_t^T$$
-**Signature** $\mathcal{S}_t$ bao gồm:
-1. **Subspace direction**: $V_t \in \mathbb{R}^{d_{in} \times r}$ (input receptive field)
-2. **Spectral profile**: $\sigma_t = (\sigma_1^t, \ldots, \sigma_r^t)$ (importance distribution)
-3. **(Optional)** Output direction: $U_t$ nếu cần output-level routing
-**Lưu trữ**: Chỉ cần $V_t$ (size $d_{in} \times r = 1024 \times 4 = 4096$ floats) + $\sigma_t$ ($r = 4$ floats) per layer per task. Với 15 tasks × 48 attention layers (T5-Large, Q+V) = 15 × 48 × 4100 ≈ 2.95M floats ≈ 11.8 MB — **rất nhỏ** so với model size.
-**So sánh với GainLoRA hiện tại**:
-- `prompt_key` = 1 vector $\in \mathbb{R}^d$ per task (point estimate, learned jointly with gating)
-- Spectral signature = $r$ vectors + $r$ scalars per task per layer (captures subspace geometry, computed from frozen weights)
-**Tại sao tốt hơn?**
-- `prompt_key` encode "input nào thuộc task này" — nhưng learned trong feature space riêng (trans_input), gây non-parallel experts problem (xem proposal cũ Phần 1.2)
-- Spectral signature encode "task này hoạt động trên subspace nào" — trực tiếp từ weight geometry, objective, không phụ thuộc vào feature extractor
-### C2: Grassmann-OT Routing
-**Ý tưởng**: Với input $h \in \mathbb{R}^{d_{in}}$ tại một layer, đo "mức phù hợp" của $h$ với mỗi LoRA subspace bằng **projection ratio**:
-$$\text{fit}(h, \mathcal{S}_t) = \frac{\|V_t^T h\|^2}{\|h\|^2} \cdot \text{spectral\_weight}_t$$
-Trong đó:
-- $\|V_t^T h\|^2 / \|h\|^2$ = fraction of $h$'s energy captured by task $t$'s subspace (= $\cos^2$ of angle giữa $h$ và subspace, hay **projection magnitude**)
-- $\text{spectral\_weight}_t = \sum_i \sigma_i^t / \sum_j \sum_i \sigma_i^j$ = relative importance of task $t$
-**Cost matrix cho OT**:
-$$C_{bt} = 1 - \text{fit}(h_b, \mathcal{S}_t) \quad \in [0, 1]$$
-(low cost = input fits well into task's subspace)
-**Sinkhorn OT**:
-$$\Pi^* = \text{Sinkhorn}(C, \varepsilon), \quad \text{weights} = B \cdot \Pi^* \quad \in \mathbb{R}^{B \times N_{tasks}}$$
-**Tại sao OT thay vì direct projection?**
-1. **Global balance**: OT đảm bảo các experts được sử dụng hợp lý (không collapse vào 1 expert)
-2. **Principled**: Optimal transport có foundation lý thuyết vững (Monge-Kantorovich)
-3. **Differentiable**: Sinkhorn có gradient → có thể fine-tune nếu cần
-**Tại sao Grassmann distance phù hợp?**
-- Subspaces $\text{span}(V_t)$ nằm trên Grassmann manifold → Grassmann distance là metric tự nhiên
-- Projection-based "fit" tương đương Grassmann geodesic distance (principal angles)
-### C3: Elastic Subspace Allocation (ESA) — Thay Thế Hard Orthogonal
-**Vấn đề**: Hard orthogonal (InfLoRA) → subspace exhaustion. Soft penalty (OLoRA) → no guarantee.
-**Giải pháp ESA**: Kết hợp **importance-weighted protection** + **controlled sharing**
-**Bước 1 — Spectral Importance Scoring**: Cho mỗi old task $t'$ tại mỗi layer, tính importance score cho mỗi direction $v_i^{t'}$:
-$$w_i^{t'} = \frac{(\sigma_i^{t'})^2}{\sum_j (\sigma_j^{t'})^2}$$
-Directions có high singular value → crucial cho task $t'$ → cần protect mạnh.
-**Bước 2 — Weighted Projection**: Thay vì hard project ra khỏi toàn bộ $\mathcal{M}_t$:
-$$B_t \leftarrow B_t - \sum_{t'<t} \sum_{i=1}^{r} \alpha_i^{t'} \cdot (V_t^{t'} (V_t^{t'})^T) B_t^T$$
-Với:
-$$\alpha_i^{t'} = \begin{cases} 1 & \text{if } w_i^{t'} > \tau_{\text{protect}} \quad \text{(hard protect critical directions)} \\ w_i^{t'} & \text{if } w_i^{t'} \leq \tau_{\text{protect}} \quad \text{(soft protect less important)} \end{cases}$$
-**Bước 3 — Space Budget**: Giới hạn tổng protected dimensions:
-$$\sum_{t'<t} \text{effective\_rank}(t') \leq \beta \cdot d_{in}$$
-Nếu vượt budget → **prune** directions có lowest $\sigma_i^{t'} $ trước (subspace recycling).
-**Ưu điểm**:
-- **Fair**: Critical directions always protected, minor directions can be shared
-- **Efficient**: Total protected space bounded by $\beta \cdot d_{in}$
-- **Adaptive**: Importance changes per task — complex tasks claim more, simple tasks claim less
-- **Theoretically grounded**: Spectral importance = proxy for output sensitivity ($\sigma_i$ reflects how much direction $i$ affects output)
-**So sánh**:
-| Phương pháp | Protection | Space usage | Fairness | Guarantee |
-|-------------|-----------|-------------|----------|-----------|
-| InfLoRA (GPM) | Hard, all directions | Monotonic increase | Unfair (first-come) | Strong for protected |
-| OLoRA | Soft penalty | Constant | Fair | Weak |
-| MINGLE (adaptive relax) | EMA-adaptive | Controlled | Medium | Medium |
-| **ESA (đề xuất)** | Importance-weighted | Bounded by budget | **Fair** | Strong for critical, soft for minor |
----
-# PHẦN 6: KIỂM TRA NOVELTY CỦA IDEA MỚI
-## 6.1 Cross-check với 109 Papers + Papers Bổ Sung
-### C1 — Spectral LoRA Signatures cho Routing
-| Paper | Cách dùng spectral | Khác biệt |
-|-------|-------------------|-----------|
-| **MINGLE** | SVD of merged task vector → entropy-based effective rank → null-space exclusion | SVD dùng cho **construction** (xây LoRA), KHÔNG phải routing signature |
-| **SD-LoRA** (ICLR'25) | Decouple magnitude + direction | Analysis purpose, không phải routing |
-| **Grassmannian MoE** (arXiv) | Bingham trên Grassmannian | Routing entropy control, KHÔNG phải knowledge signature. Và không phải CL. |
-| **Feature Distributions** (ICML'25) | Mean feature vector | Feature-level, không phải weight-level |
-**Kết luận C1**: ✅ **Novel** — Chưa có paper nào dùng SVD properties ($V_t, \Sigma_t$) của frozen LoRA weights làm routing signatures trong CL.
-### C2 — OT Routing dựa trên Grassmann Distance
-| Paper | OT usage | Routing basis | Khác biệt |
-|-------|---------|--------------|-----------|
-| **BASE Layers** (ICML'21) | OT load-balancing | Learned scores | OT cho balance, không phải knowledge matching |
-| **Selective Sinkhorn** (2025) | OT routing | Learned scores | OT cho routing nhưng cost = learned, không phải geometric |
-| **SCDEM** (2025) | OT feature alignment | Feature distance | OT cho alignment, không phải routing |
-**Kết luận C2**: ✅ **Novel** — OT + subspace projection cost (Grassmann-based) chưa được dùng trong CL routing.
-### C3 — Elastic Subspace Allocation
-| Paper | Subspace management | Khác biệt |
-|-------|-------------------|-----------|
-| **InfLoRA** | Hard GPM, threshold-based | No recycling, no importance weighting |
-| **DualGPM** | Bi-directional, threshold-based | Slightly better but same root issue |
-| **MINGLE** | Adaptive relaxation (EMA) | Gate-level, not LoRA subspace level |
-| **TRGP** (Lin et al., ICLR'22) | Trust region gradient projection | Relaxes constraint based on "trust" but no spectral importance |
-**Kết luận C3**: ✅ **Novel** — Importance-weighted subspace protection with bounded budget chưa được đề xuất.
-## 6.2 Đánh Giá Tổng Thể
-| Tiêu chí | Đánh giá |
-|----------|---------|
-| Novelty | ✅ Cao — 3 contributions đều novel |
-| Zero-replay compliance | ✅ Hoàn toàn — chỉ dùng frozen weights |
-| Mathematical rigor | ✅ Grassmann geometry, SVD, OT — all well-established |
-| Practical feasibility | ✅ SVD of $(r \times d)$ matrices rất nhanh (r=4) |
-| Compatibility | ✅ Áp dụng được trên GainLoRA, InfLoRA+GainLoRA, MINGLE |
-| Theoretical backing | ✅ Grassmann manifold (Edelman et al.), OT (Villani), Spectral theory |
----
-# PHẦN 7: IDEA SƠ THẢO TỔNG HỢP
-## SpecRoute: Spectral-Geometric Routing for Fair Continual LoRA Learning
-### Motivation (1 paragraph)
-Trong LoRA-based continual learning, hai thách thức chưa được giải quyết triệt để: (1) routing mechanism hiện tại dựa trên learned point estimates (cosine similarity đến prompt keys) — không capture được geometric structure của task knowledge subspaces, dẫn đến suboptimal assignment đặc biệt cho inputs nằm ở boundary giữa tasks; (2) orthogonal constraints (GPM/DualGPM) đảm bảo non-interference nhưng gây subspace exhaustion — tasks sau bị giới hạn capacity không công bằng so với tasks đầu, degrading overall performance. Chúng tôi nhận thấy rằng frozen LoRA weights $(A_t, B_t)$ chứa đầy đủ thông tin hình học về "vùng hoạt động" của mỗi task thông qua SVD, và thông tin này có thể được khai thác làm task signatures cho principled routing.
-### Method Overview
-**1. Spectral LoRA Signatures (Section 3.1)**
-- Sau khi train task $t$, tính SVD: $A_t B_t = U_t \Sigma_t V_t^T$
-- Signature $\mathcal{S}_t = (V_t, \Sigma_t)$ per layer — encode operating subspace + importance profile
-- Không cần old data, không cần extra computation ngoài SVD (rất nhanh cho r=4)
-**2. Grassmann-OT Routing (Section 3.2)**
-- Input $h$ → compute projection fit: $\text{fit}(h, \mathcal{S}_t) = \sum_i \sigma_i^t \cdot (v_i^t \cdot h)^2 / \|h\|^2$
-- Build cost matrix $C_{bt} = 1 - \text{normalized\_fit}$ per batch
-- Sinkhorn OT → globally optimal routing weights
-- Thay thế hoàn toàn cosine-sigmoid gating → loại bỏ non-parallel feature space problem
-**3. Elastic Subspace Allocation (Section 3.3)**
-- Weight mỗi old direction bằng spectral importance $w_i^{t'} = (\sigma_i^{t'})^2 / \sum_j (\sigma_j^{t'})^2$
-- Hard protect critical directions ($w > \tau$), soft protect minor directions
-- Bounded total protected dimensions → **fair capacity** cho late tasks
-- Optional: subspace recycling khi budget exceeded
-### Theoretical Justification
-1. **Proposition 1** (inherited from InfLoRA): Fine-tuning $A_t$ = fine-tuning $W$ in span($B_t$) → SVD of $A_t B_t$ fully characterizes task's operating subspace
-2. **Grassmann distance** giữa subspaces = principal angles = natural metric cho "task relatedness"
-3. **OT guarantees**: Sinkhorn produces $\varepsilon$-approximate optimal transport plan → globally balanced assignment
-4. **ESA bound**: Total protected capacity ≤ $\beta \cdot d_{in}$ → late tasks guaranteed ≥ $(1-\beta) \cdot d_{in}$ available directions
-### Expected Contributions Claim
-- **C1**: First to use spectral properties of frozen LoRA weights as routing signatures in CL
-- **C2**: First to combine Grassmann subspace distance with OT for routing in CL
-- **C3**: First to address LoRA subspace exhaustion via importance-weighted elastic allocation
-### Áp Dụng Trên GainLoRA
-1. Thay `prompt_key` + `trans_input` + `previous_trans_input` bằng spectral signatures + projection routing
-2. Thay GPM hard constraint bằng ESA
-3. Keep: expandable LoRA architecture, training loss, frozen old branches
-### Potential Risks & Mitigations
-| Risk | Severity | Mitigation |
-|------|---------|------------|
-| SVD per-layer overhead | Low | $r=4$ → SVD trivial; compute once after training |
-| Projection fit not discriminative enough | Medium | Add spectral weighting $\sigma_i$ to amplify important directions |
-| OT Sinkhorn convergence | Low | Log-domain Sinkhorn with $\varepsilon=0.05$, well-studied |
-| ESA τ threshold sensitivity | Medium | Cross-validate; default $\tau = 1/r$ (uniform importance threshold) |
-| Compatibility with GainLoRA gating constraints | Medium | ESA replaces GPM entirely; GainLoRA gating becomes unnecessary (routing handles expert selection) |
----
-# PHẦN 8: TÓM TẮT
-## C��c kết luận chính:
-1. **Vi phạm xác nhận**: Idea cũ (vMF data signatures + anti-invasion loss) đúng là vi phạm zero-replay setting. Chuyển hướng sang khai thác LoRA weights là hướng đi hợp lệ.
-2. **Nhận định subspace exhaustion đúng**: Hard orthogonal constraints (GPM) gây unfair capacity allocation cho late tasks. Đã được xác nhận qua phân tích toán học và code. Đây là open problem chưa ai giải quyết triệt để.
-3. **Đặc trưng LoRA phong phú**: SVD của $A_t B_t$ cung cấp rich geometric information: subspace directions, importance profile, effective rank. Nằm trên Grassmann manifold — có metric topology tự nhiên.
-4. **Idea mới (SpecRoute) viable**: 3 contributions (spectral signatures, Grassmann-OT routing, elastic subspace allocation) đều novel, hợp lệ, mathematically grounded, và áp dụng được trên GainLoRA/MINGLE platform.
-5. **Papers đồng settings**: GainLoRA, InfLoRA, O-LoRA, C-LoRA, MINGLE, TreeLoRA, PLAN, Feature Distributions, SD-LoRA — tất cả đều follow zero-replay + LoRA expansion. KHÔNG có paper nào kết hợp weight-level spectral signatures + OT routing + elastic capacity allocation.

human_working_IdeaMethod_and_discuss/settings.txt DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:b89ff4e5fc7f5b76386748a61dc1ba506edc5f8b4aa07a4388e25222879c19b8
-size 750

human_working_IdeaMethod_and_discuss/simple_idea.txt DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:cc880fdc09869b80054a5ed4209abd437ada0a86d41d84a7fd8d1a8c1d8ab0a8
-size 1304

human_working_IdeaMethod_and_discuss/work_ethic.txt DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:ea980e446e8d9233b648faacc50ff09e03ca139b8c29cdcd1a04bb8c2d8fcc92
-size 2204

human_working_IdeaMethod_and_discuss/working_method.txt DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:f4b3c616553676bae9a05c228705573d40318029a81edde6b22422dfe0784273
-size 232

improve_gainlora/v6_discuss.md ADDED Viewed

	@@ -0,0 +1,142 @@

+# Revised Implementation Plan — After Strict Zero-Replay Constraint
+> **Trigger**: User confirmed storing mean embeddings (prototypes) violates zero-replay: *"lưu bất kỳ thứ gì của dữ liệu đều vi phạm"*
+---
+## I. CONSTRAINT ANALYSIS — What's Available?
+Under strict zero-replay, at test time we have **ONLY**:
+| Information | Source | Available? |
+|------------|--------|-----------|
+| Frozen LoRA weights $A_t, B_t$ | Model training artifact | ✅ |
+| SVD of $\Delta W_t = B_t A_t$ | Derived from model params | ✅ |
+| Current input $h$ | Test sample | ✅ |
+| GPM bases | ROOT method (forward on data) | ✅ Already stored |
+| **Mean embeddings** $\mu_t$ | **Data statistic** | ❌ **VIOLATES** |
+| **Distribution params** | **Data statistic** | ❌ **VIOLATES** |
+| **Learned routing params** | Model params | ⚠️ Legal but has forgetting risk |
+### Implication
+Prototype routing (V5) is **invalid**. Available routing must use only:
+$$\text{routing}(h) = f(h; \{A_t, B_t\}_{t=1}^T)$$
+This means spectral routing is the **ONLY** parameter-free, zero-replay-compliant routing mechanism.
+---
+## II. REVISITING THE GPM-ROUTING PARADOX
+The paradox: GPM forces $A_{k'} \perp A_k$, so spectral fit favors the first expert that claimed shared input directions.
+**But is this paradox insurmountable?** Let me re-examine:
+### Severity depends on expert quality
+If expert $t$'s LoRA is **well-trained** (captures task-specific information in its 4D subspace), then even though its subspace is orthogonal to earlier experts, the SVD singular values $\sigma_t$ encode **how strongly** the expert responds to different directions. Two same-domain experts with orthogonal $V_t$ can still be discriminated IF:
+$$\sigma_t^2 (v_t^T h)^2 \gg \sigma_{t'}^2 (v_{t'}^T h)^2$$
+This happens when:
+1. Expert $t$ has **large** singular values → strong response in its subspace
+2. Input $h$ projects **significantly** onto expert $t$'s orthogonal directions
+### C4 directly addresses this
+**Preconditioned Gradients** (gradient preconditioning via $(AA^T + \epsilon I)^{-1/2}$):
+- Stabilizes training in the constrained (null-space projected) subspace
+- Expert learns more effectively **within** its allocated subspace
+- → Better singular value spectrum → more discriminative spectral signatures
+**Spectral Entropy Regularization** ($\lambda \cdot (H_{max} - H(\hat{\sigma}))$):
+- Encourages LoRA to utilize **full rank** (all 8 dimensions, not just 1-2)
+- More spread singular values → expert responds to more directions in its subspace
+- → Higher projection energy for correct inputs → better routing discrimination
+### Hypothesis: C4 is the key to making spectral routing work
+The V1-V3 failures may not be purely due to the routing MECHANISM, but also due to **poor expert quality** — LoRA trained in constrained null-space without preconditioning learns poorly, producing weak/degenerate singular values → spectral routing becomes random.
+**C4 fixes the expert quality → spectral routing becomes discriminative → performance improves.**
+---
+## III. PROPOSED APPROACH — Enhanced Spectral Routing + C4
+| Component | Choice | Rationale |
+|-----------|--------|-----------|
+| Routing | Spectral (SVD projection fit) | Only zero-replay-compliant option |
+| Training bias | Adaptive β(n) | Handle cold-start, softmax dilution |
+| Inference routing | Symmetric SVD (V3) | No bias, all tasks use same formula |
+| C4: Preconditioning | **Enabled** | Key to making null-space LoRA learn effectively |
+| C4: Spectral Entropy | **Enabled** (λ=0.01) | Full-rank LoRA → more discriminative signatures |
+| Protection | GPM on LoRA-A | Unchanged from ROOT |
+### Key insight
+**V4 crashed due to bugs, not because C4 is wrong.** The trainer code NOW has the fix:
+- [precompute_preconditioners()](file:///Users/nnminh322/Desktop/personal/Continual/improve_gainlora/src/cl_trainer_specroute.py#122-139) uses proper `torch.linalg.eigh()` + clamping
+- [_compute_spectral_entropy_loss()](file:///Users/nnminh322/Desktop/personal/Continual/improve_gainlora/src/cl_trainer_specroute.py#140-164) uses QR trick (avoids large SVD)
+- [_apply_preconditioning()](file:///Users/nnminh322/Desktop/personal/Continual/improve_gainlora/src/cl_trainer_specroute.py#165-178) includes `nan_to_num_()` guard
+### What changed from V5 script
+The V5 script ALREADY has C4 enabled (`lambda_entropy=0.01`, `use_preconditioning=True`). But V5 also has prototype routing at inference. We need:
+1. **Keep C4 enabled** (same as V5)
+2. **Disable prototype routing at inference** → fall back to symmetric SVD routing (V3)
+---
+## IV. IMPLEMENTATION
+### [MODIFY] [t5_specroute.py](file:///Users/nnminh322/Desktop/personal/Continual/improve_gainlora/src/t5_specroute.py)
+In [compute_spectral_routing()](file:///Users/nnminh322/Desktop/personal/Continual/improve_gainlora/src/t5_specroute.py#231-382), the inference path currently:
+1. Tries prototype routing first (cosine similarity to stored prototypes)
+2. Falls back to SVD spectral routing if prototypes unavailable
+**Change**: Always use SVD spectral routing at inference (skip prototype check).
+Specifically: remove or bypass the `_use_proto` path at inference, always go to the SVD-based path (the `else` branch at line 317).
+### [MODIFY] Shell Script
+Create V6 script: keep C4 enabled, output paths renamed to V6.
+- `--lambda_entropy 0.01` ✅ (keep)
+- `--use_preconditioning True` ✅ (keep)
+- All paths renamed from v5 → v6
+### No other code changes needed
+- [run_t5.py](file:///Users/nnminh322/Desktop/personal/Continual/improve_gainlora/src/run_t5.py): prototype load code safely handles missing prototype files
+- [cl_trainer_specroute.py](file:///Users/nnminh322/Desktop/personal/Continual/improve_gainlora/src/cl_trainer_specroute.py): C4 code already correct
+---
+## V. VERIFICATION PLAN
+### Primary: Run full 15-task experiment
+```bash
+bash T5_small/gen_script_long_order3_t5_small_specroute_v6.sh <model_path>
+python score.py gen_script_long_order3_t5_small_specroute_v6 <output_path>
+```
+### Diagnostic: Monitor routing quality
+- Watch for `[SpecRoute]` log lines showing routing weight distributions
+- Check if C4 entropy loss decreases (indicates LoRA using more of its rank)
+- Check singular value spectra in `spectral_signatures.pt` files
+### Expected outcomes
+- **If C4 helps**: AP 45-55 (large improvement from V3's 33.77)
+- **If C4 doesn't help**: AP ~33-35 (similar to V3) → spectral routing fundamentally limited under strict zero-replay
+- **If C4 hurts** (unstable): Training loss NaN or spike → reduce λ_entropy or disable preconditioning
+---
+## User Review Required
+> [!IMPORTANT]
+> **Core question**: Given strict zero-replay, spectral routing is the ONLY zero-replay-compliant routing mechanism. V1-V3 showed it fails. The hypothesis is that C4 (preconditioned training + spectral entropy) will make experts learn better → spectral routing becomes discriminative. This is the last viable axis before we must accept learned routing (GainLoRA-style) as necessary.
+>
+> **Do you agree with proceeding with C4-enhanced spectral routing (no prototypes)? Or do you prefer a different direction?**

results/benchmark_explanation.md DELETED Viewed

@@ -1,219 +0,0 @@
-# Benchmark & Metrics Reference — GainLoRA Paper
-> Tài liệu này giải thích dataset, metrics, và cách đọc kết quả trong paper GainLoRA.
-> Target audience: người đã biết continual learning nhưng cần nắm nhanh notation.
----
-## 1. Datasets
-### SuperNI Benchmark (Orders 1 & 2)
-**Source:** Super-Natural Instructions (SuperNI) — 1600+ NLP tasks từ nhiều domain.
-15 tasks được chọn đại diện cho các loại NLP khác nhau:
-| # | Task Name | Loại | Metric |
-|---|-----------|------|--------|
-| 1 | task1572_samsum_summary | Summarization | RougeL |
-| 2 | task363_sst2_polarity_classification | Sentiment | RougeL |
-| 3 | task1290_xsum_summarization | Summarization | RougeL |
-| 4 | task181_outcome_extraction | Information Extraction | RougeL |
-| 5 | task002_quoref_answer_generation | QA | RougeL |
-| 6 | task1510_evalution_relation_extraction | Relation Extraction | RougeL |
-| 7 | task639_multi_woz_user_utterance_generation | Dialogue Generation | RougeL |
-| 8 | task1729_personachat_generate_next | Dialogue Generation | RougeL |
-| 9 | task073_commonsenseqa_answer_generation | Commonsense QA | RougeL |
-| 10 | task1590_diplomacy_text_generation | Text Generation | RougeL |
-| 11 | task748_glucose_reverse_cause_event_detection | Causal Reasoning | RougeL |
-| 12 | task511_reddit_tifu_long_text_summarization | Summarization | RougeL |
-| 13 | task591_sciq_answer_generation | Science QA | RougeL |
-| 14 | task1687_sentiment140_classification | Sentiment | RougeL |
-| 15 | task875_emotion_classification | Emotion | RougeL |
-**Đặc điểm:** Tasks đa dạng về loại (generation, classification, extraction) → khó giữ nhiều kỹ năng cùng lúc. Metric dùng RougeL (F1 token overlap), scale 0–100.
-**Task orders:**
-- **Order 1**: Sequential từ summarization → classification (tasks sắp xếp theo domain heterogeneity)
-- **Order 2**: Random shuffle để test independence với thứ tự
-### Long Benchmark (Orders 3 & 4)
-**Source:** Các NLP benchmarks phổ biến (GLUE-style + text classification).
-15 tasks:
-| # | Task | Loại | Metric |
-|---|------|------|--------|
-| 1 | yelp | Sentiment (fine-grained) | Exact Match |
-| 2 | amazon | Product sentiment | Exact Match |
-| 3 | mnli | NLI | Exact Match |
-| 4 | cb | NLI (FewShot) | Exact Match |
-| 5 | copa | Causal reasoning | Exact Match |
-| 6 | qqp | Paraphrase detection | Exact Match |
-| 7 | rte | Textual entailment | Exact Match |
-| 8 | imdb | Sentiment binary | Exact Match |
-| 9 | sst2 | Sentiment binary | Exact Match |
-| 10 | dbpedia | Topic classification | Exact Match |
-| 11 | agnews | News topic | Exact Match |
-| 12 | yahoo | Topic QA | Exact Match |
-| 13 | multirc | Multi-sentence RC | Exact Match |
-| 14 | boolq | Boolean QA | Exact Match |
-| 15 | wic | Word sense disambiguation | Exact Match |
-**Đặc điểm:** Nhiều tasks classification → competition giữa các class distributions. Metric dùng Exact Match (%), scale 0–100. Kết quả baseline cao hơn SuperNI (vì tasks ít diverse hơn → ít catastrophic forgetting).
-**Task orders:**
-- **Order 3**: yelp → wic (ordered từ classification → understanding)
-- **Order 4**: mnli → yahoo (scrambled)
----
-## 2. Metrics: AP và FT
-### AP (Average Performance) ↑
-$$AP = \frac{1}{T} \sum_{j=1}^{T} R_{T,j}$$
-- $R_{T,j}$ = score trên task $j$ **sau khi đã xong task $T$ (task cuối)**
-- Đây là con số **quan trọng nhất** — phản ánh khả năng "nhớ tất cả" sau CL
-- **Cao hơn = tốt hơn**
-- Baseline chạy 100 epochs (A100), OT-SIGN chạy 50 epochs (T4) → AP ta có thể thấp hơn baseline do epochs ít hơn, không phải do method kém
-### FT (Forgetting) ↓
-$$FT = \frac{1}{T-1} \sum_{j=1}^{T-1} \left( R_{j,j} - R_{T,j} \right)$$
-- $R_{j,j}$ = score trên task $j$ **ngay khi vừa train xong task $j$** (peak performance)
-- $R_{T,j}$ = score trên task $j$ **sau khi train hết tất cả** (final performance)
-- FT = **trung bình lượng score bị giảm** sau khi học thêm tasks mới
-- **Thấp hơn = tốt hơn** (ít forgetting)
-- FT = 0 nghĩa là model nhớ hoàn toàn tất cả tasks sau CL
-### Result Matrix R
-```
-         Task1  Task2  Task3  ...  Task15
-Train1  [ R11    -      -           -   ]   ← chỉ evaluate task vừa học
-Train2  [ R21   R22     -           -   ]
-Train3  [ R31   R32    R33          -   ]
-...
-Train15 [ R151  R152   R153  ...  R1515 ]
-         ↑                         ↑
-       Final                    Final
-       perf                     perf
-       task1                    task15
-Diagonal: R_jj = peak performance (train từng task)
-Last row: R_15j = final performance sau CL
-AP = mean(last row)
-FT = mean(diagonal - last row), j=1..14
-```
----
-## 3. Phân tích baseline — GainLoRA đứng đâu?
-### Table 1 (SuperNI, T5-Large)
-**Tốt nhất trước GainLoRA:**
-- InfLoRA: AP=39.78, FT=7.64 (Order 1)
-- GainLoRA (InfLoRA): AP=**46.21**, FT=**2.40** — cải thiện ~6.4 AP, giảm FT 5x
-**Điểm mạnh GainLoRA:**
-- FT cực thấp (2.4 vs 7.64 của InfLoRA) → ít forgetting hơn nhiều
-- AP cao hơn tất cả methods khác kể cả LFPT5 (39.03)
-**Nhận xét về Final Stage:**
-- Final stage (task 15) là quan trọng nhất trong deployment
-- GainLoRA FT thấp → model vẫn perform tốt trên task 1-14 khi đang làm task 15
-- Đây là điểm yếu chính của các method cũ: O-LoRA FT=19.15 nghĩa là mỗi task bị quên trung bình ~19 điểm
-### Table 2 (Long, T5-Large)
-**Context:**
-- Long benchmark dễ hơn → AP cao hơn (70-80% range vs 40-46% range)
-- GainLoRA vẫn outperform InfLoRA: AP=78.01 vs 75.15 (Order 3)
-- FT cực thấp: 0.77 (gần như không quên!)
----
-## 4. OT-SIGN — Cải thiện gì so với GainLoRA baseline?
-| Component | GainLoRA | OT-SIGN+GainLoRA |
-|-----------|----------|------------------|
-| Expert routing | Cosine similarity + sigmoid gating | Sinkhorn OT với vMF signature |
-| Continual protection | GPM gradient projection | + Anti-drift (MSE on trans_input) |
-| Expert invasion | Không có | + Anti-invasion hinge loss |
-| Knowledge repr | Prompt key vector | vMF distribution (mu, kappa) |
-### Kỳ vọng:
-- **FT nên thấp hơn GainLoRA** (anti-drift + anti-invasion hoạt động đúng)
-- **AP tương đương hoặc cao hơn** (OT routing chính xác hơn cosine)
-- Nếu AP thấp hơn đáng kể → có thể do epochs ít (50 vs 100), không phải method kém
----
-## 5. Cách đọc log và compute_ap_ft.py
-### Sau mỗi task hoàn thành, terminal sẽ in:
-```
-[RunLogger] After task 3 (task1290_xsum_summarization) — predict scores:
-  task1572_samsum_summary                         45.23
-  task363_sst2_polarity_classification            67.81
-  task1290_xsum_summarization                     52.14
-```
-### Sau task cuối (task 15), terminal sẽ in AP/FT tự động:
-```
-════════════════════════════════════════════════════════════════════════
-  OT-SIGN+GainLoRA Order1
-════════════════════════════════════════════════════════════════════════
-  #   Task                                             Peak   Final    Drop
-────────────────────────────────────────────────────────────────────
-  1   task1572_samsum_summary                         48.12   44.30   3.82
-  ...
-  15  task875_emotion_classification                  71.20   71.20   0.00
-────────────────────────────────────────────────────────────────────
-  AP  =  47.83   │   FT =  2.15
-════════════════════════════════════════════════════════════════════════
-```
-### Chạy thủ công sau khi xong:
-```bash
-python src/compute_ap_ft.py \
-  --output_base logs_and_outputs/ot_sign_order1_t5large/outputs \
-  --task_order "task1572_samsum_summary,..." \
-  --method_name "OT-SIGN+GainLoRA Order1" \
-  --save
-```
-Kết quả lưu ở `logs_and_outputs/ot_sign_order1_t5large/ap_ft_result.json`.
----
-## 6. Thời gian ước tính trên 2×T4
-| Script | Order | Benchmark | Epochs | Tasks | Ước tính |
-|--------|-------|-----------|--------|-------|----------|
-| run_ot_sign_order1_t5large.sh | 1 | SuperNI | 50 | 15 | ~9-10h |
-| run_ot_sign_order2_t5large.sh | 2 | SuperNI | 50 | 15 | ~9-10h |
-| run_ot_sign_order3_t5large.sh | 3 | Long | 10 | 15 | ~3-4h |
-| run_ot_sign_order4_t5large.sh | 4 | Long | 10 | 15 | ~3-4h |
-**Chạy song song:** Order 1+3 trên GPU 0,1 và Order 2+4 trên GPU 0,1 (2 server riêng)
-**Chạy tuần tự:** ~22-28h tổng, khuyến nghị chạy Order 3+4 trước (nhanh hơn, validate pipeline)
----
-## 7. FAQ
-**Q: rougeL range là gì?**
-A: 0–100 sau multiply (compute_metrics trả về 0-1, code đã nhân 100x). Trên SuperNI, range typical là 20-80.
-**Q: GainLoRA chạy 100 epochs trên A100, mình chạy 50 epochs trên T4 — có fair không?**
-A: Không hoàn toàn fair về tuyệt đối. Nhưng mục tiêu là thấy delta: nếu OT-SIGN+GainLoRA(50ep/T4) > GainLoRA(50ep/T4) thì contribution rõ ràng. Để so sánh với paper, cần chạy cùng setups.
-**Q: FT âm có nghĩa gì?**
-A: Score cuối cao hơn peak → model "cải thiện" task cũ nhờ học tasks mới (positive transfer). Hiếm nhưng có thể xảy ra với tasks liên quan nhau.
-**Q: AP thấp hơn baseline dù FT tốt hơn?**
-A: Có thể do peak performance của từng task thấp → R[j,j] thấp → final row cũng thấp. Nghĩa là OT routing có thể gây interference lúc học task đầu tiên. Kiểm tra peak scores từng task để diagnose.

results/comparison_results.md DELETED Viewed

@@ -1,257 +0,0 @@
-# GainLoRA vs OT-SIGN Results — Direct Comparison
-> **Cách đọc:** AP↑ (cao hơn tốt hơn) | FT↓ (thấp hơn tốt hơn)
-> Số trong bảng đều là **rougeL (%)** cho superni / **exact_match (%)** cho long.
-> Lấy kết quả từ `compute_ap_ft.py` sau mỗi lần chạy xong 15 tasks.
----
-## Table 1: T5-Large — SuperNI Benchmark (Orders 1 & 2)
-| Method | Order 1 AP↑ | Order 1 FT↓ | Order 2 AP↑ | Order 2 FT↓ |
-|--------|-------------|-------------|-------------|-------------|
-| LFPT5* | 39.03 | 10.87 | 29.70 | 20.72 |
-| EWC | 15.32 | 26.78 | 18.19 | 30.28 |
-| TaSL | 27.51 | 18.53 | 28.05 | 17.39 |
-| KIFLoRA | 28.33 | 16.44 | 30.31 | 16.27 |
-| SeqLoRA | 7.30 | 47.60 | 7.03 | 47.97 |
-| IncLoRA | 12.33 | 41.93 | 16.65 | 36.56 |
-| C-LoRA | 22.69 | 24.25 | 32.81 | 11.60 |
-| O-LoRA | 26.37 | 19.15 | 32.83 | 11.99 |
-| InfLoRA | 39.78 | 7.64 | 39.57 | 8.93 |
-| **GainLoRA (InfLoRA)** | **46.21** | **2.40** | **46.44** | **2.61** |
-| **OT-SIGN+GainLoRA (ours)** | | | | |
-## Table 2: T5-Large — Long Benchmark (Orders 3 & 4)
-| Method | Order 3 AP↑ | Order 3 FT↓ | Order 4 AP↑ | Order 4 FT↓ |
-|--------|-------------|-------------|-------------|-------------|
-| EPI* | — | — | 75.19 | 0.77 |
-| MIGU+FT | — | — | 71.30 | 11.39 |
-| EWC | 43.24 | 23.66 | 46.25 | 32.90 |
-| TaSL | 71.37 | 6.20 | 73.11 | 6.52 |
-| KIFLoRA | 72.19 | 3.10 | 73.72 | 4.75 |
-| SeqLoRA | 49.46 | 27.60 | 33.81 | 45.53 |
-| IncLoRA | 61.19 | 13.63 | 62.46 | 15.92 |
-| C-LoRA | 66.83 | 8.64 | 61.86 | 14.18 |
-| O-LoRA | 70.98 | 3.69 | 71.21 | 4.03 |
-| InfLoRA | 75.15 | 4.19 | 75.79 | 3.47 |
-| **GainLoRA (InfLoRA)** | **78.01** | **0.77** | **77.54** | **1.20** |
-| **OT-SIGN+GainLoRA (ours)** | | | | |
----
----
-## Table 3: Llama — SuperNI Benchmark (from GainLoRA paper)
-### Llama-2-7B
-| Method | Order 1 AP↑ | Order 1 FT↓ | Order 2 AP↑ | Order 2 FT↓ |
-|--------|-------------|-------------|-------------|-------------|
-| O-LoRA | 39.37 | 15.84 | 37.55 | 20.23 |
-| GainLoRA (O-LoRA) | 51.10 | 4.96 | 51.14 | 5.57 |
-| InfLoRA | 42.93 | 11.23 | 39.94 | 15.00 |
-| **GainLoRA (InfLoRA)** | **51.27** | **2.84** | **50.17** | **4.71** |
-| **SpecRoute (ours)** | | | | |
-### Llama-2-13B
-| Method | Order 1 AP↑ | Order 1 FT↓ | Order 2 AP↑ | Order 2 FT↓ |
-|--------|-------------|-------------|-------------|-------------|
-| O-LoRA | 43.92 | 14.15 | 40.05 | 19.53 |
-| GainLoRA (O-LoRA) | 52.47 | 4.78 | 51.68 | 5.86 |
-| InfLoRA | 43.64 | 14.85 | 45.74 | 10.61 |
-| **GainLoRA (InfLoRA)** | **53.64** | **2.87** | **52.46** | **4.90** |
-| **SpecRoute (ours)** | | | | |
-### Llama-3-8B
-| Method | Order 1 AP↑ | Order 1 FT↓ | Order 2 AP↑ | Order 2 FT↓ |
-|--------|-------------|-------------|-------------|-------------|
-| O-LoRA | 42.49 | 8.85 | 38.67 | 19.28 |
-| GainLoRA (O-LoRA) | 53.39 | 3.56 | 51.69 | 6.20 |
-| InfLoRA | 43.27 | 6.02 | 48.77 | 5.88 |
-| **GainLoRA (InfLoRA)** | **52.18** | **1.40** | **52.48** | **4.21** |
-| **SpecRoute (ours)** | | | | |
----
-## Table 4: Ablation Study — GainLoRA with T5-Large & Llama-2-7B (from paper)
-| Method | T5-Large O1 AP↑ | T5-Large O1 FT↓ | T5-Large O2 AP↑ | T5-Large O2 FT↓ | Llama-2-7B O1 AP↑ | Llama-2-7B O1 FT↓ | Llama-2-7B O2 AP↑ | Llama-2-7B O2 FT↓ |
-|--------|---|---|---|---|---|---|---|---|
-| GainLoRA (O-LoRA) | 47.84 | 2.26 | 46.84 | 2.91 | 51.10 | 4.96 | 51.14 | 5.57 |
-| No Init Constraints | 35.30 | 17.19 | 39.82 | 12.90 | 44.02 | 11.71 | 42.89 | 14.77 |
-| No Update Constraints | 23.01 | 30.32 | 24.96 | 28.14 | 33.74 | 23.06 | 34.71 | 22.36 |
-| No Constraints | 26.32 | 26.00 | 30.63 | 22.37 | 34.48 | 23.46 | 36.87 | 21.24 |
-| GainLoRA (InfLoRA) | 46.21 | 2.40 | 46.44 | 2.61 | 51.27 | 2.84 | 50.17 | 4.71 |
-| No Init Constraints | 45.38 | 3.40 | 43.05 | 5.15 | 50.48 | 3.48 | 48.17 | 6.45 |
-| No Update Constraints | 37.69 | 10.94 | 38.85 | 9.31 | 48.52 | 5.68 | 47.85 | 7.00 |
-| No Constraints | 36.75 | 12.18 | 41.00 | 6.66 | 49.10 | 6.07 | 45.77 | 8.70 |
-> **Note**: "Init Constraints" = LoRA_A null-space projection (GPM), "Update Constraints" = GainLoRA gating + prompt_key routing
----
-## Per-Task Breakdown — Order 1 (fill after running)
-Chạy lệnh sau để lấy số điền vào:
-```bash
-python src/compute_ap_ft.py \
-  --output_base logs_and_outputs/ot_sign_order1_t5large/outputs \
-  --task_order "task1572_samsum_summary,task363_sst2_polarity_classification,task1290_xsum_summarization,task181_outcome_extraction,task002_quoref_answer_generation,task1510_evalution_relation_extraction,task639_multi_woz_user_utterance_generation,task1729_personachat_generate_next,task073_commonsenseqa_answer_generation,task1590_diplomacy_text_generation,task748_glucose_reverse_cause_event_detection,task511_reddit_tifu_long_text_summarization,task591_sciq_answer_generation,task1687_sentiment140_classification,task875_emotion_classification" \
-  --method_name "OT-SIGN+GainLoRA Order1"
-```
-| # | Task | GainLoRA Peak | GainLoRA Final | OT-SIGN Peak | OT-SIGN Final |
-|---|------|--------------|---------------|-------------|--------------|
-| 1 | task1572_samsum_summary | | | | |
-| 2 | task363_sst2_polarity_classification | | | | |
-| 3 | task1290_xsum_summarization | | | | |
-| 4 | task181_outcome_extraction | | | | |
-| 5 | task002_quoref_answer_generation | | | | |
-| 6 | task1510_evalution_relation_extraction | | | | |
-| 7 | task639_multi_woz_user_utterance_generation | | | | |
-| 8 | task1729_personachat_generate_next | | | | |
-| 9 | task073_commonsenseqa_answer_generation | | | | |
-| 10 | task1590_diplomacy_text_generation | | | | |
-| 11 | task748_glucose_reverse_cause_event_detection | | | | |
-| 12 | task511_reddit_tifu_long_text_summarization | | | | |
-| 13 | task591_sciq_answer_generation | | | | |
-| 14 | task1687_sentiment140_classification | | | | |
-| 15 | task875_emotion_classification | | | | |
-| | **AP / FT** | **46.21 / 2.40** | | | |
-## Per-Task Breakdown — Order 2
-```bash
-python src/compute_ap_ft.py \
-  --output_base logs_and_outputs/ot_sign_order2_t5large/outputs \
-  --task_order "task748_glucose_reverse_cause_event_detection,task073_commonsenseqa_answer_generation,task1590_diplomacy_text_generation,task639_multi_woz_user_utterance_generation,task1572_samsum_summary,task1687_sentiment140_classification,task591_sciq_answer_generation,task363_sst2_polarity_classification,task1510_evalution_relation_extraction,task1729_personachat_generate_next,task181_outcome_extraction,task511_reddit_tifu_long_text_summarization,task002_quoref_answer_generation,task1290_xsum_summarization,task875_emotion_classification" \
-  --method_name "OT-SIGN+GainLoRA Order2"
-```
-| # | Task | GainLoRA Peak | GainLoRA Final | OT-SIGN Peak | OT-SIGN Final |
-|---|------|--------------|---------------|-------------|--------------|
-| 1 | task748_glucose_reverse_cause_event_detection | | | | |
-| 2 | task073_commonsenseqa_answer_generation | | | | |
-| 3 | task1590_diplomacy_text_generation | | | | |
-| 4 | task639_multi_woz_user_utterance_generation | | | | |
-| 5 | task1572_samsum_summary | | | | |
-| 6 | task1687_sentiment140_classification | | | | |
-| 7 | task591_sciq_answer_generation | | | | |
-| 8 | task363_sst2_polarity_classification | | | | |
-| 9 | task1510_evalution_relation_extraction | | | | |
-| 10 | task1729_personachat_generate_next | | | | |
-| 11 | task181_outcome_extraction | | | | |
-| 12 | task511_reddit_tifu_long_text_summarization | | | | |
-| 13 | task002_quoref_answer_generation | | | | |
-| 14 | task1290_xsum_summarization | | | | |
-| 15 | task875_emotion_classification | | | | |
-| | **AP / FT** | **46.44 / 2.61** | | | |
-## Per-Task Breakdown — Order 3 (Long)
-```bash
-python src/compute_ap_ft.py \
-  --output_base logs_and_outputs/ot_sign_order3_t5large/outputs \
-  --task_order "yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic" \
-  --method_name "OT-SIGN+GainLoRA Order3"
-```
-| # | Task | GainLoRA Peak | GainLoRA Final | OT-SIGN Peak | OT-SIGN Final |
-|---|------|--------------|---------------|-------------|--------------|
-| 1 | yelp | | | | |
-| 2 | amazon | | | | |
-| 3 | mnli | | | | |
-| 4 | cb | | | | |
-| 5 | copa | | | | |
-| 6 | qqp | | | | |
-| 7 | rte | | | | |
-| 8 | imdb | | | | |
-| 9 | sst2 | | | | |
-| 10 | dbpedia | | | | |
-| 11 | agnews | | | | |
-| 12 | yahoo | | | | |
-| 13 | multirc | | | | |
-| 14 | boolq | | | | |
-| 15 | wic | | | | |
-| | **AP / FT** | **78.01 / 0.77** | | | |
-## Per-Task Breakdown — Order 4 (Long)
-```bash
-python src/compute_ap_ft.py \
-  --output_base logs_and_outputs/ot_sign_order4_t5large/outputs \
-  --task_order "mnli,cb,wic,copa,qqp,boolq,rte,imdb,yelp,amazon,sst2,dbpedia,agnews,multirc,yahoo" \
-  --method_name "OT-SIGN+GainLoRA Order4"
-```
-| # | Task | GainLoRA Peak | GainLoRA Final | OT-SIGN Peak | OT-SIGN Final |
-|---|------|--------------|---------------|-------------|--------------|
-| 1 | mnli | | | | |
-| 2 | cb | | | | |
-| 3 | wic | | | | |
-| 4 | copa | | | | |
-| 5 | qqp | | | | |
-| 6 | boolq | | | | |
-| 7 | rte | | | | |
-| 8 | imdb | | | | |
-| 9 | yelp | | | | |
-| 10 | amazon | | | | |
-| 11 | sst2 | | | | |
-| 12 | dbpedia | | | | |
-| 13 | agnews | | | | |
-| 14 | multirc | | | | |
-| 15 | yahoo | | | | |
-```bash
-# Chạy 4 lệnh này để lấy đủ số cho cả 2 bảng:
-python src/compute_ap_ft.py --output_base logs_and_outputs/ot_sign_order1_t5large/outputs --task_order "task1572_samsum_summary,task363_sst2_polarity_classification,task1290_xsum_summarization,task181_outcome_extraction,task002_quoref_answer_generation,task1510_evalution_relation_extraction,task639_multi_woz_user_utterance_generation,task1729_personachat_generate_next,task073_commonsenseqa_answer_generation,task1590_diplomacy_text_generation,task748_glucose_reverse_cause_event_detection,task511_reddit_tifu_long_text_summarization,task591_sciq_answer_generation,task1687_sentiment140_classification,task875_emotion_classification" --save
-python src/compute_ap_ft.py --output_base logs_and_outputs/ot_sign_order2_t5large/outputs --task_order "task748_glucose_reverse_cause_event_detection,task073_commonsenseqa_answer_generation,task1590_diplomacy_text_generation,task639_multi_woz_user_utterance_generation,task1572_samsum_summary,task1687_sentiment140_classification,task591_sciq_answer_generation,task363_sst2_polarity_classification,task1510_evalution_relation_extraction,task1729_personachat_generate_next,task181_outcome_extraction,task511_reddit_tifu_long_text_summarization,task002_quoref_answer_generation,task1290_xsum_summarization,task875_emotion_classification" --save
-python src/compute_ap_ft.py --output_base logs_and_outputs/ot_sign_order3_t5large/outputs --task_order "yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic" --save
-python src/compute_ap_ft.py --output_base logs_and_outputs/ot_sign_order4_t5large/outputs --task_order "mnli,cb,wic,copa,qqp,boolq,rte,imdb,yelp,amazon,sst2,dbpedia,agnews,multirc,yahoo" --save
-```
----
-## Table 3: T5-Small — Long Benchmark (Order 3)
-| Method | Order 3 AP↑ | Order 3 FT↓ |
-|--------|-------------|-------------|
-| **GainLoRA (Root)** | **59.70** | N/A* |
-| **SpecRoute (Improve)** | 39.74† | N/A* |
-> *\*FT = N/A: cả 2 log chạy thiếu `--do_predict`. Lần tiếp theo dùng script `T5_small/` đã sửa sẽ có đủ FT.*
-> *†Điểm Improve tính từ `predict_eval_predictions.jsonl` của từng task (hàng chéo score matrix). imdb/sst2/wic về 0 do Catastrophic Forgetting.*
-### ⚠️ Root GainLoRA tốt hơn SpecRoute trên T5-Small (−19.96 AP)
-SpecRoute bị Catastrophic Forgetting nghiêm trọng ở các task phân loại sentiment (imdb=0.21, sst2=0.00, yahoo=8.12, wic=0.00). Nguyên nhân có thể do SVD rank không đủ lớn ở T5-Small, làm routing mechanism không phân tách được subspace của các task.
-## Per-Task Breakdown — Order 3 (T5-Small)
-| # | Task | GainLoRA (Root) | SpecRoute (Improve) | Δ (Improve−Root) |
-|---|------|-----------------|--------------------|-----------------|
-| 1 | yelp | 56.01 | 54.36 | −1.65 |
-| 2 | amazon | 52.05 | 50.01 | −2.04 |
-| 3 | mnli | 34.07 | 35.50 | +1.43 |
-| 4 | cb | 3.57 | 0.00 | −3.57 |
-| 5 | copa | 42.00 | 44.00 | +2.00 |
-| 6 | qqp | 76.96 | 76.72 | −0.24 |
-| 7 | rte | 45.85 | 50.90 | +5.05 |
-| 8 | imdb | 89.51 | 0.21 | **−89.30 ⚠️** |
-| 9 | sst2 | 85.21 | 0.00 | **−85.21 ⚠️** |
-| 10 | dbpedia | 98.16 | 92.22 | −5.94 |
-| 11 | agnews | 88.37 | 68.76 | −19.61 |
-| 12 | yahoo | 57.28 | 8.12 | **−49.16 ⚠️** |
-| 13 | multirc | 50.52 | 54.23 | +3.71 |
-| 14 | boolq | 60.43 | 61.13 | +0.70 |
-| 15 | wic | 55.49 | 0.00 | **−55.49 ⚠️** |
-| | **AP / FT** | **59.70 / N/A** | **39.74 / N/A** | **−19.96** |

results/contribution2_implementation_analysis.md DELETED Viewed

@@ -1,176 +0,0 @@
-# Contribution 2 (C4): Spectrally-Conditioned LoRA Training — Implementation Analysis
-## 1. Summary
-C4 addresses **single-task LoRA quality** — the second pillar of continual learning performance (alongside catastrophic forgetting). Even with perfect routing (C1/C3 SpecRoute) and null-space constraints (InfLoRA), each task's LoRA adapter can underperform because:
-- **Gradient distortion**: B gradients are distorted by frozen A's non-orthogonal column space
-- **Low effective rank**: CE loss alone doesn't encourage full utilization of LoRA's rank budget
-- **Suboptimal A initialization**: InfLoRA projects A into null-space, but the resulting A may have poor spectral conditioning
-C4 proposes two complementary fixes:
-1. **Preconditioned gradient**: $(AA^T + \epsilon I)^{-1/2} \nabla_B$ corrects gradient distortion from frozen A
-2. **Spectral entropy regularization**: Maximizes effective rank of $BA$ to fully utilize the rank budget
-## 2. Preconditioned Gradient — Mathematical Foundation
-### Problem: Gradient Distortion
-In standard LoRA, the update $\Delta W = BA$ where A is frozen (InfLoRA constraint). The gradient of loss w.r.t. B is:
-$$\nabla_B \mathcal{L} = \nabla_{\Delta W} \mathcal{L} \cdot A^T$$
-When A's columns are non-orthogonal (typical after null-space projection), $A^T$ distorts the gradient direction. Directions aligned with A's dominant singular vectors get amplified, while directions aligned with small singular vectors get suppressed.
-### Solution: Spectral Preconditioning
-Apply $(AA^T + \epsilon I)^{-1/2}$ to B's gradient after backward:
-$$\tilde{\nabla}_B = \nabla_B \mathcal{L} \cdot (AA^T + \epsilon I)^{-1/2}$$
-This equalizes gradient magnitudes across all directions in A's column space, allowing B to learn uniformly across all rank dimensions.
-### Implementation
-```python
-def precompute_preconditioners(self):
-    for lora in [module.lora_q, module.lora_v]:
-        A = lora.lora_A.data.float()          # [d_in, r]
-        AAt = A.T @ A                          # [r, r]
-        AAt += eps * I
-        eigvals, eigvecs = torch.linalg.eigh(AAt)
-        inv_sqrt = eigvecs @ diag(eigvals^{-0.5}) @ eigvecs^T
-        store inv_sqrt for lora_B
-```
-**Key property**: Computed ONCE after `get_reg_matrix()` projects A into null-space. Since A is frozen during training, the preconditioner is constant — no per-step overhead.
-**Compatibility with GPM**: GPM projects A into null-space ONCE before training starts. Preconditioning operates on B's gradients AFTER backward. These are completely independent operations on different parameters at different times.
-## 3. Spectral Entropy Regularization — Mathematical Foundation
-### Problem: Low Effective Rank
-CE loss optimizes for task accuracy but doesn't care about the spectral structure of $BA$. In practice, $BA$ often has very low effective rank — most of the "learning budget" (rank r) is wasted on near-zero singular values.
-### Solution: Maximize Spectral Entropy
-Define the normalized singular values of $BA$:
-$$\hat{\sigma}_i = \frac{\sigma_i(BA)}{\sum_j \sigma_j(BA)}$$
-The spectral entropy is:
-$$H = -\sum_i \hat{\sigma}_i \log \hat{\sigma}_i$$
-Maximum entropy $H_{max} = \log(r)$ occurs when all singular values are equal (full rank utilization).
-### Regularization Loss
-$$\mathcal{L}_{total} = \mathcal{L}_{CE} + \lambda \sum_{\ell} (H_{max} - H_\ell)$$
-where the sum is over all LoRA layers $\ell$.
-### Efficient QR Trick
-Computing SVD of the full $BA$ matrix ($d_{out} \times d_{in}$) is expensive. Instead:
-1. $Q_B, R_B = QR(B^T)$ → $R_B$ is $r \times r$
-2. $Q_A, R_A = QR(A)$ → $R_A$ is $r \times r$
-3. $\hat{\sigma} = \text{svdvals}(R_B \cdot R_A^T)$ → SVD of $r \times r$ matrix
-This gives the same singular values as $BA$ but costs $O(r^3)$ instead of $O(d_{out} \cdot d_{in} \cdot r)$.
-### Implementation
-```python
-def _compute_spectral_entropy_loss(self):
-    for lora in [module.lora_q, module.lora_v]:
-        B = lora.lora_B.float()    # [r, d_out]
-        A = lora.lora_A.float()    # [d_in, r]
-        _, R_B = torch.linalg.qr(B.T)      # R_B: [r, r]
-        _, R_A = torch.linalg.qr(A)         # R_A: [r, r]
-        sigma_hat = torch.linalg.svdvals(R_B @ R_A.T)  # [r]
-        sigma_hat = sigma_hat / (sigma_hat.sum() + eps)
-        ent = -(sigma_hat * log(sigma_hat + eps)).sum()
-        loss += (log(r) - ent)
-    return loss / count
-```
-## 4. Pipeline Integration
-### Training Pipeline (per task):
-```
-1. Load model + previous LoRA weights
-2. get_reg_matrix()           ← InfLoRA: project A into null-space
-3. precompute_preconditioners() ← C4: compute (AA^T+εI)^{-1/2}
-4. Training loop:
-   a. Forward pass → CE loss
-   b. If step >= warmup: compute spectral entropy loss
-   c. total_loss = CE + λ * entropy_loss
-   d. backward(total_loss)
-   e. _apply_preconditioning()   ← C4: modify B gradients
-   f. optimizer.step()
-5. get_representation()       ← GPM: update subspace bases
-6. Save model + GPM bases
-```
-### Key Integration Points:
-- `precompute_preconditioners()` after `get_reg_matrix()` (A is frozen, preconditioner is constant)
-- Entropy loss added BEFORE backward (part of computational graph)
-- Preconditioning applied AFTER backward (direct gradient modification)
-- Warmup ratio prevents entropy regularization from dominating early training
-## 5. Synergy with Existing Contributions
-| Component | C1 (Spectral Routing) | C3 (Inference Routing) | C4 (LoRA Quality) |
-|-----------|----------------------|----------------------|-------------------|
-| Target | Task selection | Inference accuracy | Single-task quality |
-| Mechanism | SVD signatures | Symmetric routing | Precond + entropy |
-| When | Forward pass | Inference time | Training time |
-| Interacts with | C4 (better LoRA → better signatures) | C1 (routing probabilities) | C1 (better training → better routing) |
-**Virtuous cycle**: C4 improves each LoRA's quality → spectral signatures become more distinctive → C1 routing becomes more accurate → less interference → better continual learning.
-## 6. Hyperparameters
-| Parameter | Default | Range | Role |
-|-----------|---------|-------|------|
-| `lambda_entropy` | 0.01 | [0.001, 0.1] | Weight of spectral entropy loss |
-| `use_preconditioning` | True | {True, False} | Enable gradient preconditioning |
-| `precond_eps` | 1e-6 | [1e-8, 1e-4] | Numerical stability for preconditioner |
-| `entropy_warmup_ratio` | 0.1 | [0.0, 0.3] | Fraction of steps before enabling entropy |
-## 7. Ablation Plan
-| Experiment | Precond | Entropy | Purpose |
-|------------|---------|---------|---------|
-| V3 (baseline) | ✗ | ✗ | Current best |
-| V4a | ✓ | ✗ | Isolate preconditioning effect |
-| V4b | ✗ | ✓ | Isolate entropy effect |
-| V4 (full) | ✓ | ✓ | Full C4 |
-| V4-λ sweep | ✓ | ✓ (λ ∈ {0.001, 0.01, 0.1}) | Sensitivity analysis |
-## 8. Risk Assessment
-### Low Risk:
-- Preconditioning is a well-established technique (natural gradient, K-FAC)
-- Spectral entropy is a differentiable, smooth regularizer
-- Both components are additive — easy to disable if harmful
-### Medium Risk:
-- Entropy regularization may conflict with task-specific spectral structure (some tasks may genuinely need low-rank updates)
-- Preconditioner may be ill-conditioned if A has very small singular values (mitigated by ε)
-### Mitigation:
-- Warmup ratio delays entropy regularization until CE loss stabilizes
-- ε in preconditioner prevents numerical issues
-- Ablation plan isolates each component's effect
-## 9. Theoretical Guarantees
-### Preconditioned Gradient:
-- If A has condition number κ, standard gradient descent on B has convergence rate O(κ²)
-- With preconditioning, convergence rate improves to O(1) (condition-number-independent)
-- This is equivalent to natural gradient descent in the B parameter space
-### Spectral Entropy:
-- Maximum entropy ⟺ all singular values equal ⟺ effective rank = r
-- This maximizes the "information capacity" of the LoRA adapter
-- Connected to matrix information theory: max-entropy distribution on singular values
-## 10. Code Changes Summary
-### Files Modified:
-1. **`cl_trainer_specroute.py`**: +4 init params, +3 methods (`precompute_preconditioners`, `_compute_spectral_entropy_loss`, `_apply_preconditioning`), modified `training_step`
-2. **`run_t5.py`**: +4 dataclass fields, updated SpecRoute_Trainer constructor, added `precompute_preconditioners()` call
-3. **`gen_script_long_order3_t5_small_specroute_v4.sh`**: New V4 experiment script with C4 args
-### Lines of Code: ~80 lines of new logic (excluding comments/docstrings)
-### Dependencies: No new dependencies (uses only torch.linalg built-ins)

results/deep_theoretical_analysis_svd_lora_routing.md ADDED Viewed

	@@ -0,0 +1,545 @@

+# Deep Theoretical Analysis: SVD, Frozen A, and Routing Methodology
+## ⚠️ REVISION NOTE (v2)
+**Phiên bản trước (v1) mắc sai lầm nghiêm trọng**: Khuyến nghị giữ prototype routing (V5) — nhưng prototype = mean embedding = **thống kê dữ liệu** → **VI PHẠM zero-replay**. Đã sửa toàn bộ phân tích.
+**Tham chiếu phản biện**: `v6_discuss.md` — phân tích đúng rằng prototype violates zero-replay.
+## Preamble
+Phân tích này tuân thủ nghiêm ngặt:
+- **settings.txt**: zero-replay (không dùng data cũ dưới mọi hình thức — bao gồm thống kê, phân phối, mean embeddings)
+- **research_rule.txt**: lý thuyết → weakness → motivation → cải tiến → thực nghiệm
+- **work_method**: theory-first, không thử-sai
+Tham chiếu ≥30 papers về toán học, lý thuyết thông tin, ma trận, LoRA.
+---
+## I. NGUYÊN NHÂN GỐC RỄ CỦA "NEVER-LEARNING" PHENOMENON
+### 1.1 Câu hỏi
+CB (EM=0.00), MNLI (34.07), và một số task có single-task quality thấp hơn ROOT. Contribution C2 (routing gate) không liên quan tới nội hàm single-task training — vì routing chỉ quyết định **ai xử lý**, không quyết định **xử lý tốt hay không**.
+### 1.2 Phân tích cấu trúc: A frozen + GPM → Compound Constraint
+Xét task $t$ trong sequence. InfLoRA constraint:
+$$A_t \in \text{null}\left(\sum_{j<t} A_j A_j^\top\right)$$
+Tức $A_t$ phải nằm trong null-space của projection matrix $P_{\text{old}} = \sum_{j<t} V_j V_j^\top$ (với $V_j$ là GPM bases của task $j$).
+**Hệ quả dimension**: Null-space dimension giảm theo tasks:
+$$\dim(\text{null}) \leq d - \sum_{j<t} r_j \cdot \gamma_j$$
+với $\gamma_j$ là fraction of capacity consumed (controlled by ESA threshold).
+Với $d=512$, $r=8$, threshold=0.995 → mỗi task tiêu thụ ~$r_{\text{eff}}$ dimensions. Sau 14 tasks, null-space còn ~$512 - 14 \times r_{\text{eff}}$.
+**Paper references**:
+- **InfLoRA** (Sun et al., CVPR 2024): Proves null-space initialization prevents retroactive interference BUT doesn't guarantee learning quality
+- **GORP** (ACL 2025): Unified gradient subspace projection — shows projection shrinks expressiveness
+- **CMCL** (NeurIPS 2025): Dual stability/plasticity bounds — tradeoff is fundamental
+### 1.3 Three Root Causes of Never-Learning
+#### Cause 1: Gradient Distortion (cấp bậc toán học)
+Khi A frozen, gradient của B là:
+$$\nabla_B \mathcal{L} = \nabla_{\Delta W} \mathcal{L} \cdot A^\top$$
+SVD của $A$: $A = U_A \Sigma_A V_A^\top$ (với $A \in \mathbb{R}^{r \times d}$).
+Gradient bị **nhân phải** bởi $A^\top$. Nếu $\kappa(A) = \sigma_1(A)/\sigma_r(A) \gg 1$:
+- Directions tương ứng $\sigma_{\min}(A)$ nhận gradient **rất nhỏ** → slow convergence
+- Adam normalizes per-parameter nhưng KHÔNG normalize per-direction trong space $\mathbb{R}^r$
+- Effective learning rate trở nên anisotropic theo condition number
+**Paper references**:
+- **Muon/Riemannion** (Jordan et al., 2025): Proposes Riemannian LoRA trên fixed-rank manifold, shows Euclidean gradient suboptimal cho rank-constrained optimization
+- **LORO** (ICLR 2025): Steepest descent on manifold $\mathcal{M}_r$, addresses anisotropy
+- **SD-LoRA** (ICLR 2025): Direction/magnitude decomposition — decouples learning paths
+- **LoKO** (2024): Kalman filter optimizer cho LoRA — shows conditioning issue is real
+**Quantitative**: Sau GPM projection + normalization ($A \leftarrow A / (\sqrt{3} \|A\|)$), $A$ có các singular values xấp xỉ uniform (do random init + orthogonal projection). Tuy nhiên **effective** condition number phụ thuộc vào alignment giữa random A directions và task-relevant directions.
+#### Cause 2: Random A ≠ Optimal A (Information-Theoretic)
+$A_t$ được init bằng Kaiming uniform rồi project vào null-space. Từ góc nhìn information theory:
+**Mutual Information Bound** (Data Processing Inequality):
+$$I(h; z_t) = I(h; A_t h) \leq H(z_t) \leq \frac{r}{2} \log\left(1 + \frac{\text{Var}[A_t h]}{r}\right)$$
+Random $A_t$ **không maximize** $I(h; z_t)$ cho task-specific $h$ distribution. Optimal $A_t$ nên align với top-$r$ principal components của task data covariance WITHIN null-space.
+**Hiểu đơn giản**: Random A chọn $r$ directions ngẫu nhiên từ null-space ($d'$ dimensions). Xác suất chọn đúng "best" $r$ directions ≈ 0 khi $d' \gg r$.
+**Paper references**:
+- **PLAN** (ICCV 2025): Proactive rank allocation via perturbation sensitivity — shows different layers need different ranks. Random allocation wastes capacity.
+- **TreeLoRA** (ICML 2025): Gradient similarity tree for layer-wise allocation
+- **Information Bottleneck Theory** (Tishby et al., 2000): LoRA = bottleneck with capacity $C = \sum \sigma_i^2$
+- **Angle Matters** (ICML 2025): Angle between task signals determines forgetting rate AND learning rate
+**Ví dụ cụ thể**: CB có 250 samples, 3 classes (entailment/contradiction/neutral). Cần A directions aligned với phân biệt semantic giữa 3 labels. Random A từ null-space (dimension ~390 ở task 4) chỉ chọn 8 random directions → hầu hết **không liên quan** tới task signal.
+#### Cause 3: CE Loss Alone + Insufficient Training
+Cross-entropy chỉ optimize prediction accuracy. Không có mechanism nào đảm bảo:
+- **Effective rank** của $\Delta W = BA$ cao → spectral health
+- **Utilization** đồng đều các rank directions
+- **Regularization** chống overfitting trên tiny datasets
+CB: 250 samples × 10 epochs = 2500 iterations × batch=8 = ~80 steps. Adam optimizer cần ~100s steps để converge cho NLI task.
+**Paper references**:
+- **Stiefel-LoRA** (EMNLP 2025): $B^\top B = I_r$ constraint maximizes effective rank
+- **SD-LoRA** (ICLR 2025): Explicit rank regularization through directional alignment
+- **SEFE** (ICML 2025): Distinguishes superficial vs essential forgetting — CE loss conflates them
+- **LoRA–** (CVPR 2025): Triplet loss in drift-resistant space for better single-task quality
+### 1.4 Kết luận: Never-learning KHÔNG do routing
+Routing (C2) chỉ quyết định input → expert mapping. Never-learning do:
+1. **A frozen + GPM projection** → restricted learning capacity (structural)
+2. **Random A** trong null-space → poor alignment với task signal (information-theoretic)
+3. **Insufficient training** cho tiny datasets (practical)
+**Quan trọng**: ROOT dùng CÙNG InfLoRA (A frozen + GPM) nhưng ROOT CB = 3.57 (cũng near-fail). Vấn đề này là **fundamental limitation** của InfLoRA approach, KHÔNG phải specific to SpecRoute.
+---
+## II. SVD NHƯ CHỮ KÝ — CÓ CẦN XEM XÉT LẠI?
+### 2.1 Bản chất SVD
+SVD phân rã $\Delta W = BA = U \Sigma V^\top$:
+- $V$ (right singular vectors): **input directions** expert "lắng nghe"
+- $U$ (left singular vectors): **output directions** expert "phát ra"
+- $\Sigma$ (singular values): **cường độ** modification
+Spectral signature hiện tại dùng $(V, \sigma)$ — input receptive field + importance.
+### 2.2 A frozen ảnh hưởng SVD như thế nào?
+Đây là câu hỏi then chốt. Xét:
+$$\Delta W_t = B_t A_t$$
+Với $A_t$ frozen (Kaiming init + GPM projection), $\text{rank}(\Delta W_t) \leq r$.
+**SVD structural constraint**:
+$$\Delta W_t = B_t A_t = (U_B \Sigma_B V_B^\top)(U_A \Sigma_A V_A^\top)$$
+$$= U_B \Sigma_B (V_B^\top U_A) \Sigma_A V_A^\top$$
+Gọi $M = \Sigma_B (V_B^\top U_A) \Sigma_A$ ∈ $\mathbb{R}^{r \times r}$, SVD($M$) = $P S Q^\top$. Khi đó:
+$$\Delta W_t = (U_B P) S (Q^\top V_A^\top) = \tilde{U} S \tilde{V}^\top$$
+**Key insight**: Right singular vectors của $\Delta W_t$ là:
+$$\tilde{V} = V_A Q$$
+Tức **right singular vectors luôn nằm trong row-space của A**:
+$$\text{col}(\tilde{V}^\top) = \text{col}(Q^\top V_A^\top)^\top \subseteq \text{row}(A_t)$$
+**Hệ quả sâu sắc**: SVD spectral signature bị **GIỚI HẠN** bởi $\text{row}(A_t)$!
+**Paper references**:
+- **Eckart-Young Theorem** (1936): Best rank-$k$ approximation = truncated SVD
+- **Weyl's Perturbation Theorem**: $|\sigma_i(A+E) - \sigma_i(A)| \leq \|E\|_2$
+- **Horn & Johnson** (Matrix Analysis, 2013): SVD of product $BA$ — structure thm
+- **Interlacing Inequalities** (Thompson, 1976): Singular values of products
+### 2.3 Vấn đề: A frozen + GPM → SVD signatures structurally constrained
+Do GPM: $A_k \perp A_j$ (approximately, via projection), nên:
+$$\text{row}(A_k) \perp \text{row}(A_j) \quad \forall j < k$$
+**Theorem (informal)**: Nếu $A_k \perp A_j$ exact, thì $V_k \perp V_j$ exact (right singular vectors orthogonal).
+**Chứng minh**: $V_k \in \text{row}(A_k)$ và $V_j \in \text{row}(A_j)$, mà $\text{row}(A_k) \perp \text{row}(A_j)$, nên $V_k^\top V_j = 0$.
+**Hệ quả cho routing**: Spectral routing dùng:
+$$\text{fit}_t(h) = \frac{\sum_i \sigma_{t,i}^2 (v_{t,i}^\top h)^2}{\sum_i \sigma_{t,i}^2 \|h\|^2}$$
+Khi $V_k \perp V_j$, routing phụ thuộc hoàn toàn vào **energy projection**:
+- $h$ có bao nhiêu energy trong $\text{row}(A_k)$ vs $\text{row}(A_j)$?
+**VẤN ĐỀ CHÍNH**: Nếu hai task cùng domain (e.g., yelp/amazon — cả hai là sentiment 5-class), input $h$ của chúng SIMILAR. Nhưng do GPM, $A_{\text{yelp}}$ và $A_{\text{amazon}}$ orthogonal → energy projection phụ thuộc vào **RANDOM** A directions, KHÔNG phụ thuộc vào task semantics.
+Đây chính xác là **GPM-Routing Paradox** đã nhận diện — nhưng bây giờ ta thấy vấn đề còn sâu hơn: **SVD signatures bị A deterministic** khi A frozen.
+### 2.4 A frozen + B trainable có phù hợp với SVD?
+**Câu trả lời ngắn**: Có, nhưng với caveats quan trọng.
+**Phù hợp**: SVD($BA$) ĐÚNG phản ánh effective modification direction. B training thay đổi singular values + left singular vectors, nhưng right singular vectors bị constrained bởi $\text{row}(A)$. SVD vẫn đúng mathematically.
+**Không phù hợp cho ROUTING**: Right singular vectors ($V$) — thứ ta dùng cho routing — bị tied to A's row-space. Khi A frozen + orthogonalized → routing information bị **A deterministic**, không phản ánh **B's learned task-specifics**.
+**Formal statement**: Routing quality bounded by:
+$$\max_{h} |\text{fit}_k(h) - \text{fit}_j(h)| \leq 1 - \cos^2(\text{row}(A_k), \text{row}(A_j))$$
+Khi $A_k \perp A_j$ (GPM), $\cos^2 = 0$ → max discrimination = 1 (TUYỆT ĐỐI). **Nhưng** discrimination phụ thuộc vào h alignment, và h KHÔNG được control bởi routing mechanism.
+**Paper references**:
+- **Function Vectors** (Todd et al., ICLR 2025): Models carry compact task-encoding "function vectors" — h encodes task info nhưng KHÔNG guaranteed align với random A directions
+- **GIFT** (CVPR 2025): Fisher Information as Riemannian metric — shows parameter sensitivity IS task-dependent
+- **Low-rank Forgetting Analysis** (NeurIPS 2025): Forgetting matrix $F = \Delta W_{\text{old}}^\top \Delta W_{\text{new}}$ — has low-rank structure
+### 2.5 Kết luận về SVD signatures
+1. **SVD mathematically correct** cho decompose $\Delta W = BA$
+2. **Right singular vectors** ($V$) bị **tied to row(A)** khi A frozen → chủ yếu reflect A's structure
+3. **Singular values** ($\sigma$) DO reflect B's learning → key discriminative signal
+4. **Routing via (V, σ)**: Phụ thuộc cả A (directions) LẪN B (magnitudes). Khi C4 cải thiện B training quality → σ spectrum phong phú hơn → discrimination tăng
+5. **Câu hỏi mở (V6 sẽ trả lời)**: C4 có đủ mạnh để bù đắp V bị tied to row(A)? Tức là σ²-weighting có đủ discriminative khi V orthogonal?
+---
+## II-bis. PROTOTYPE ROUTING VI PHẠM ZERO-REPLAY
+### Phân tích lỗi logic trong v1
+Phiên bản v1 khuyến nghị prototype routing (V5) dựa trên Data Processing Inequality:
+$$I(T; f(\Delta W, h)) \leq I(T; A_t h) \leq I(T; h)$$
+Kết luận: "prototype dùng h trực tiếp → bypass A bottleneck → tối ưu".
+**Tuy nhiên**, prototype $\mu_t = \frac{1}{N_t}\sum_{i} h_i^{(t)}$ là **thống kê dữ liệu** (mean of training embeddings). Theo settings.txt:
+> "không được phép sử dụng lại bất kỳ dữ liệu cũ dưới bất kỳ hình thức nào bao gồm dữ liệu thô, dữ liệu synthetic, **phân phối dữ liệu (được tạo ra nhờ các công cụ thống kê)**"
+Mean embedding = first moment of distribution = thống kê → **vi phạm**.
+### Ranh giới hợp lệ: Model Parameters vs Data Statistics
+| Information | Nguồn | Hợp lệ? | Lý do |
+|------------|--------|---------|-------|
+| $A_t, B_t$ (frozen LoRA) | Training artifact | ✅ | Model parameters |
+| SVD$(B_t A_t)$ | Derived from params | ✅ | Pure computation on params |
+| GPM bases ($U$ from covariance SVD) | Forward on data → covariance | ⚠️ **Biên** | ROOT cũng dùng, accepted by community |
+| Mean embedding $\mu_t$ | Forward on data → statistic | ❌ | Data distribution statistic |
+| Distribution params (variance, etc.) | Forward on data → statistic | ❌ | Explicit distribution |
+**Quan sát quan trọng**: GPM bases cũng được tính từ training data (forward 1000 batches → covariance → SVD → bases). Khác biệt then chốt:
+- GPM bases dùng cho **protection** (constrain future A), xong rồi **xóa feature_list** (line 304) — standard CL technique
+- Prototypes dùng cho **inference routing** trên TEST data — lưu và sử dụng vĩnh viễn → DỮ LIỆU CŨ ẢNH HƯỞNG TRỰC TIẾP TỚI PREDICTIONS trên data mới
+### Hệ quả: Spectral routing = ONLY zero-replay-compliant option
+Tại inference, chỉ có: $h$ (test input) + $\{A_t, B_t\}_{t=1}^T$ (model params). Routing PHẢI là:
+$$\text{routing}(h) = f(h; \{A_t, B_t\}_{t=1}^T)$$
+Spectral routing (Rayleigh quotient) thỏa điều kiện này. Learned routing (ROOT's MLP) cũng thỏa. Prototype routing **KHÔNG** thỏa.
+---
+## III. CƠ CHẾ ROUTING HIỆN TẠI CÓ TỐI ƯU?
+### 3.1 Routing hiện tại: σ²-weighted Rayleigh quotient
+$$w_t(h) = \text{softmax}\left(\frac{\mathbf{h}^\top V_t \text{diag}(\sigma_t^2) V_t^\top \mathbf{h}}{\|\sigma_t\|^2 \|\mathbf{h}\|^2}\right)$$
+Đây là **Rayleigh quotient** của ma trận PSD $V_t \text{diag}(\sigma_t^2) V_t^\top$ w.r.t. $h$.
+### 3.2 Phân tích Information-Theoretic
+Từ góc nhìn **sufficient statistics** (Fisher-Neyman):
+**Routing problem**: Cho observation $h$, decide $t^* = \arg\max_t P(t|h)$.
+**Bayes optimal**: $w_t(h) = P(t|h) \propto P(h|t) P(t)$.
+**Spectral routing giả định gì?** Nó giả định $P(h|t)$ tương ứng với energy projection lên $V_t$. Tức là, $h$ từ task $t$ sẽ có energy tập trung trong $\text{span}(V_t)$.
+**Khi nào giả định đúng?**
+- Khi A directions align với task-discriminative input subspace
+- Khi different tasks' inputs separate trong A's column space
+**Khi nào giả định SAI?**
+- Same-domain tasks → $h$ similar → projection lên orthogonal $A_k, A_j$ cho RANDOM results
+- Task input không align với random A → low fit for ALL tasks → routing degeneracy
+### 3.3 Alternatives: Công cụ toán học nào tốt hơn?
+#### Alternative 1: Nuclear Norm / Trace Norm
+Thay vì SVD spectral routing, dùng **trace of product**:
+$$\text{aff}_t(h) = \|(\Delta W_t)^\top h\|_2^2 = h^\top (BA)^\top (BA) h = h^\top A^\top B^\top B A h$$
+Đây chính xác là tính toán hiện tại nhưng **không cần SVD** — chỉ cần $B^\top B$ (Gram matrix).
+**Insight**: SVD chỉ là decomposition của $B^\top B A^\top A h$. Nếu A orthogonalized, $A^\top A \approx c \cdot I$ trên row-space → SVD reduction to $B^\top B$.
+**Paper references**:
+- **Marchenko-Pastur Law** (1967): Random matrix $A$ với entries i.i.d. → $A^\top A / n \to I$ as $d \to \infty$
+- **Random Matrix Theory** (Anderson, 2003): Concentration of spectral norms for random projections
+#### Alternative 2: Frobenius Inner Product (Direct Affinity)
+$$\text{aff}_t(h) = \|\Delta W_t \odot h h^\top\|_F = \text{tr}(\Delta W_t^\top \Delta W_t h h^\top)$$
+Measures how much $\Delta W_t$ modifies input $h$ → more direct measure of "expert relevance".
+#### Alternative 3: Grassmannian Distance
+Thay vì project $h$ lên subspaces, đo **geodesic distance** giữa subspace $\text{col}(V_t)$ và direction $h/\|h\|$:
+$$d_G(h, V_t) = \arccos\left(\frac{\|V_t^\top h\|}{\|h\|}\right)$$
+**Paper references**:
+- **Absil et al.** (2004, 2008): Optimization on Stiefel and Grassmann manifolds — geodesic distances, exponential maps
+- **Hamm & Lee** (2008): Grassmann discriminant analysis — subspace classification
+- **Edelman, Arias, Smith** (1998): Geometry of algorithms with orthogonality constraints
+#### Alternative 4: CKA (Centered Kernel Alignment)
+$$\text{CKA}(K_h, K_t) = \frac{\|K_h K_t\|_F}{\|K_h\|_F \|K_t\|_F}$$
+Với $K_t = V_t V_t^\top$ (projection kernel) và $K_h = h h^\top$.
+**Paper references**:
+- **Kornblith et al.** (2019): CKA for comparing neural network representations
+- **Nguyen et al.** (2021): CKA variants for continual learning similarity
+#### Alternative 5: Projection Residual (Information Loss)
+Thay vì "how much energy projects", đo "how much information LOST after projection":
+$$\text{loss}_t(h) = \|(I - V_t V_t^\top) h\|^2 / \|h\|^2 = 1 - \text{fit}_t(h)$$
+Routing bằng minimizing information loss. Mathematically identical to current approach nhưng conceptually clearer.
+### 3.4 Quan sát then chốt: DPI đúng nhưng KHÔNG có nghĩa spectral routing thất bại
+**Theorem (Fundamental Routing Information Bound under InfLoRA)**:
+Cho $\Delta W_t = B_t A_t$ với $A_t$ frozen. BẤT KỲ routing function $f(h, \Delta W_t)$ dựa trên $\Delta W_t$ hoặc các derived quantities (SVD, norms, projections):
+$$f(h, B_t A_t) \text{ chỉ phân biệt h qua } A_t h$$
+vì $B_t A_t h = B_t (A_t h)$ — output chỉ phụ thuộc $h$ thông qua projection $A_t h$.
+**Data Processing Inequality**: $I(T; f) \leq I(T; A_t h) \leq I(T; h)$
+**NHƯNG — sửa lỗi v1**: DPI nói $I(T; A_t h) \leq I(T; h)$, KHÔNG nói $I(T; A_t h) = 0$.
+Thực tế, dù $A_t$ random, nó vẫn capture $r/d$ fraction energy:
+$$\mathbb{E}[\|A_t h\|^2 / \|h\|^2] = r/d \approx 8/512 = 1.56\%$$
+Và quan trọng hơn: **singular values $\sigma_t$ encode B's task-specific learning** (Section 2.4). Rayleigh quotient:
+$$\text{fit}_t(h) = \frac{\sum_i \sigma_{t,i}^2 (v_{t,i}^\top h)^2}{\sum_i \sigma_{t,i}^2 \|h\|^2}$$
+Với C4 làm $\sigma$ spectrum phong phú (high effective rank), discrimination TĂNG vì:
+1. Expert response mạnh hơn (lớn σ) → fit_t cao cho matching inputs
+2. Full-rank LoRA → expert capture nhiều directions hơn → h projection richer
+**Kết luận sửa**: Spectral routing bị information bottleneck, nhưng bottleneck có thể **đủ rộng** nếu expert quality cao. V6 test chính xác giả thuyết này.
+**Paper references**:
+- **Data Processing Inequality** (Cover & Thomas, 2006): $I(X;Z) \leq I(X;Y)$ khi $X \to Y \to Z$
+- **Sufficient Statistics** (Fisher-Neyman Factorization Theorem): Routing optimal khi sử dụng sufficient statistic for task identity
+- **Information Bottleneck** (Tishby et al., 2000): Tradeoff compression vs. prediction
+- **Johnson-Lindenstrauss Lemma** (1984): Random projection preserves distances with $O(\log T / \epsilon^2)$ dimensions — 8 dims bảo toàn distance giữa ~15 task centroids with high probability
+### 3.5 V6 Hypothesis: C4 Makes Spectral Routing Viable
+**Giả thuyết (v6_discuss.md)**: V1-V3 thất bại KHÔNG chỉ do routing mechanism, mà do **expert quality TỆ** (no C4):
+- Không có preconditioning → gradient distortion → B learns poorly → σ degenerate
+- Không có entropy regularization → rank-1 LoRA → 1 direction thống trị → routing degeneracy
+**C4 fixes expert quality → σ spectrum phong phú → spectral routing trở nên discriminative.**
+**Logic chain**:
+$$\text{C4 (preconditioning + entropy)} \to \text{Better B training} \to \text{Richer σ spectrum}$$
+$$\to \text{Higher effective rank} \to \text{More discriminative fit}_t(h)$$
+$$\to \text{Better routing} \to \text{Performance improvement}$$
+**Verification**: V5 có C4 + prototype. V6 = C4 only → isolate C4's contribution.
+**Dự đoán V6**:
+- Nếu C4 là key: AP(EM) ~45-55 (vượt xa V3=27.66, nhưng không bằng V5=59.55 vì mất prototype advantage)
+- Nếu C4 không đủ: AP(EM) ~30-35 (tương tự V3)
+- Nếu C4 destabilize: NaN hoặc spike → giảm λ hoặc disable preconditioning
+---
+## IV. TẬP HỢP VÀ ĐỀ XUẤT (REVISED v2)
+### 4.1 Summary of Theoretical Findings
+| Finding | Implication |
+|---------|------------|
+| Never-learning do A frozen + GPM, KHÔNG do routing | C2 (routing) đúng hướng, cần C4 (expert quality) mạnh hơn |
+| SVD signatures ĐÚNG mathematically, $V \subseteq \text{row}(A)$ | Routing DIRECTIONS constrained, nhưng σ² MAGNITUDES reflective |
+| GPM ⊥ → perfect subspace separation | Routing dựa vào σ²-weighted projection, KHÔNG chỉ V directions |
+| DPI: $I(T; Ah) \leq I(T; h)$ | Bottleneck tồn tại, nhưng KHÔNG = 0; có thể đủ rộng nếu expert quality cao |
+| **Prototype routing VI PHẠM zero-replay** | **V5 INVALID theo settings.txt** |
+| C4 cải thiện expert quality → richer σ → better routing | **V6 hypothesis: C4 = key to making spectral routing work** |
+### 4.2 Prototype Routing: Tại sao sai và bài học
+**V5 sai lầm**: Prototype $\mu_t = \text{mean}(h_i^{(t)})$ = **first moment of data distribution** → vi phạm zero-replay.
+**Bài học**: Information-theoretic optimality ($I(T; h)$ vs $I(T; Ah)$) PHẢI tuân theo constraint. Giải pháp "tối ưu" nhưng vi phạm settings = vô giá trị.
+**GPM bases cũng dùng data** nhưng chỉ cho protection (standard CL practice, ROOT cũng dùng). Ranh giới: data CÓ THỂ dùng cho constrain future learning (GPM), KHÔNG THỂ dùng cho influence future predictions (prototypes).
+### 4.3 SVD Routing + C4: Giả thuyết V6
+**V6 = spectral routing (V3 mechanism) + C4 (preconditioning + entropy)**
+Tại sao V3 thất bại (AP=27.66)?
+1. V3 chạy SCRIPT SAI (threshold=0.98 thay vì 0.995) — **bug, không phải limitation**
+2. V3 KHÔNG có C4 → LoRA trains poorly → degenerate σ → routing random
+3. Train-inference mismatch: adaptive bias at train, SVD at inference → B optimized under wrong routing
+V6 fixes:
+1. ✅ Threshold = 0.995 (đúng)
+2. ✅ C4 enabled (preconditioning + entropy)
+3. ✅ Adaptive bias at train, symmetric SVD at inference (mechanism unchanged but expert quality vastly better)
+**Dự đoán (theory-based)**:
+C4 addresses Cause 1 (gradient distortion) và Cause 3 (CE-only loss) từ Section I:
+- Preconditioner $(AA^\top + \epsilon I)^{-1/2}$ equalizes gradient → condition-number-independent learning
+- Entropy regularization → maximizes effective rank → expert responds to more directions
+Nếu V6 AP ~45-55 → C4 là significant contributor → **spectral routing viable under zero-replay**
+Nếu V6 AP ~30 → expert quality KHÔNG đủ → GPM-Routing Paradox remains dominant
+### 4.4 Contribution Idea cần chỉnh sửa gì?
+#### C1 (Spectral Signatures) — GIỮI, VAI TRÒ ĐÚng
+- SVD signatures dùng cho cả routing LẪN characterization
+- Thin SVD optimization (QR+SVD) vẫn valid
+- σ values là key discriminative signal (kết hợp V directions)
+#### C2 (Routing) — CẦN REVISION
+- ~~Prototype routing~~ → **Xóa**, vi phạm zero-replay
+- **Giữ**: SVD spectral routing (Rayleigh quotient) — ONLY valid option
+- **Giữ**: Adaptive bias $\beta(n)$ for training cold-start
+- **Giữ**: Symmetric SVD inference routing
+- **Mới**: C4 là yếu tố quyết định *chất lượng* routing (không phải mechanism mới mà là *quality of what's being routed TO*)
+#### C3 (ESA) — GIỮI
+- Dynamic threshold works
+#### C4 (Preconditioning + Entropy) — **TRỞ THÀNH THEN CHỐT**
+- Không chỉ improve single-task quality
+- **Trực tiếp improve routing quality** qua richer σ spectrum
+- C4 = bridge giữa protection (GPM) và routing (spectral affinity)
+- Preconditioning: gradient equalization → all rank directions learn → non-degenerate σ
+- Entropy: explicit rank maximization → more responsive expert → better fit discrimination
+### 4.5 Đề xuất sửa đổi cụ thể
+#### 1. DEPLOY V6 (branch `new`)
+V6 = spectral routing + C4 = **ONLY valid configuration** under zero-replay.
+Prediction range: AP(EM) 45-55 (optimistic: C4 strong) hoặc 30-35 (pessimistic: C4 insufficient)
+#### 2. Update SPECROUTE_IDEA.md
+- Remove C2.1 (prototype routing section)
+- Re-frame C4 as enabling technology cho C2 (not independent contribution)
+- Routing–Protection Duality theorem STILL VALID (spectral affinity ↔ protection quality)
+- New framing: "C4 completes the duality — protection provides direction discrimination, C4 provides magnitude discrimination"
+#### 3. Nếu V6 AP ~30 (C4 insufficient) → Next direction
+- Relaxed orthogonality: $A_t = (1-\eta) A_t^{\perp} + \eta A_t^{\parallel}$ — allow small overlap
+  - Tradeoff: slight forgetting ↑ but routing discrimination ↑↑
+  - $\eta \in [0.05, 0.2]$
+  - ZERO-REPLAY COMPLIANT (only modifies A init, no data stored)
+- Adaptive epochs cho tiny datasets (CB: 250 samples → more steps)
+---
+## V. KẾT LUẬN (REVISED)
+### Trả lời câu hỏi ban đầu
+1. **Nguyên nhân gốc rễ never-learning**: A frozen + GPM projection → restricted capacity + random A ≠ optimal A + insufficient training. KHÔNG do routing. ROOT cũng near-fail CB (3.57). Đây là InfLoRA fundamental tradeoff.
+2. **SVD có cần xem xét lại?**:
+   - SVD mathematically correct
+   - $V \subseteq \text{row}(A)$ → directions constrained, nhưng **σ values reflect B's learning** → key discriminative signal
+   - A frozen LÀ lựa chọn tốt cho anti-forgetting
+   - SVD dùng cho routing: σ²-weighted Rayleigh quotient ĐÚNG, nhưng cần C4 boost expert quality
+   - **KHÔNG cần thay SVD bằng tool khác** — vấn đề là expert quality, không phải measurement method
+3. **Routing mechanism tối ưu?**:
+   - Spectral routing = ONLY zero-replay-compliant parameter-free option
+   - ~~Prototype routing~~ **vi phạm zero-replay** → loại bỏ
+   - V1-V3 thất bại có thể do EXPERT QUALITY (no C4), không chỉ routing mechanism
+   - V6 (SVD + C4) = experiment để test C4 hypothesis
+   - DPI bottleneck tồn tại nhưng Johnson-Lindenstrauss cho thấy random projection 8-dim đủ preserve distances cho ~15 tasks
+### Recommendation
+1. **DEPLOY V6** (branch `new`) — SVD routing + C4 = đúng constraint
+2. **V5 prototype routing INVALID** — vi phạm zero-replay
+3. **C4 trở thành contribution then chốt** — không chỉ C1 add-on mà ENABLES C2 routing quality
+4. **Chờ V6 results** trước khi quyết định tiếp:
+   - AP ≥ 45 → spectral + C4 viable, tiếp tục tối ưu
+   - AP ≤ 35 → cần relaxed orthogonality hoặc fundamental rethink
+---
+## Paper References (30+)
+### LoRA & Low-Rank Optimization
+1. Hu et al. (2022) — LoRA: Low-Rank Adaptation of Large Language Models
+2. Sun et al. (CVPR 2024) — InfLoRA: Interference-Free Low-Rank Adaptation
+3. Jordan et al. (2025) — Muon/Riemannion: Riemannian LoRA
+4. LORO (ICLR 2025) — Low-rank Riemannian Optimizer
+5. SD-LoRA (ICLR 2025) — Singular value Direction decomposition
+6. Stiefel-LoRA (EMNLP 2025) — Orthogonal B constraints
+7. LoKO (2024) — Kalman Filter LoRA Optimizer
+8. LoRA– (CVPR 2025) — Triplet loss in Drift-Resistant Space
+### Continual Learning
+9. GORP (ACL 2025) — Unified Gradient Subspace Projection
+10. CMCL (NeurIPS 2025) — Dual Stability/Plasticity Bounds
+11. CaLoRA (NeurIPS 2025) — Causal Gradient Adaptation
+12. PLAN (ICCV 2025) — Proactive Rank Allocation
+13. TreeLoRA (ICML 2025) — Gradient Similarity Tree
+14. Angle Matters (ICML 2025) — Angular task signal analysis
+15. SEFE (ICML 2025) — Superficial vs Essential Forgetting
+16. GIFT (CVPR 2025) — Fisher Information LoRA
+17. Low-rank Forgetting Analysis (NeurIPS 2025)
+### SVD & Matrix Theory
+18. Eckart-Young (1936) — Best low-rank approximation
+19. Weyl's Perturbation Theorem — Singular value stability
+20. Horn & Johnson (2013) — Matrix Analysis (product SVD)
+21. Thompson (1976) — Interlacing inequalities for products
+22. Marchenko-Pastur (1967) — Random matrix spectral distribution
+23. Anderson (2003) — Random Matrix Theory
+### Grassmannian & Manifold Geometry
+24. Absil et al. (2004, 2008) — Optimization on Grassmann manifolds
+25. Hamm & Lee (2008) — Grassmann Discriminant Analysis
+26. Edelman, Arias, Smith (1998) — Geometry with orthogonality
+### Information Theory
+27. Cover & Thomas (2006) — Data Processing Inequality
+28. Tishby et al. (2000) — Information Bottleneck Method
+29. Fisher-Neyman Factorization Theorem — Sufficient statistics
+30. Kornblith et al. (2019) — CKA representation similarity
+31. Todd et al. (ICLR 2025) — Function Vectors
+32. Nguyen et al. (2021) — CKA for continual learning
+33. **Johnson & Lindenstrauss (1984)** — Random projection preserves distances
+### GainLoRA Specific
+34. GainLoRA (original paper) — Gating + InfLoRA architecture
+35. GPM (Saha et al., 2021) — Gradient Projection Memory
+### Appendix: V1 Analysis Error Log
+**V1 lỗi**: Khuyến nghị prototype routing (V5) nhưng prototype = mean embedding = data statistic → vi phạm zero-replay.
+**Root cause**: DPI argument ($I(T;h) > I(T;Ah)$) đúng mathematically nhưng bỏ qua constraint. Tối ưu hóa information access nhưng vi phạm allowable information set.
+**Bài học**: Trong research, correctness = mathematical validity + constraint satisfaction. Giải pháp information-theoretically optimal nhưng violate settings = invalid. Đây là ví dụ tốt cho research_rule: luôn verify solution against ALL constraints trước khi recommend.

results/experiment_versions.md CHANGED Viewed

@@ -1,542 +1,192 @@
-# SpecRoute — Báo cáo Thử nghiệm theo Version
-> Tracking tất cả versions thử nghiệm, kết quả, phân tích, và cải tiến.
-> Benchmark: Long Sequence Order 3, 15 classification tasks, model T5-Small.
 ---
-## Version 1.0 — Baseline SpecRoute (Kết quả đầu tiên)
-### Kịch bản thử nghiệm
-- **Model**: T5-Small (d_model=512, 6 encoder + 6 decoder layers)
-- **Method**: SpecRoute — spectral routing (SVD of LoRA B@A) thay thế learned routing (trans_input + prompt_key) của GainLoRA
-- **So sánh**: ROOT GainLoRA-InfLoRA (original codebase)
-- **Hyperparameters**: lora_r=8, lora_alpha=32, lr=3e-4, 10 epochs, threshold=0.995
-- **Platform**: Kaggle T4 GPU
-### Kết quả
-| # | Task | ROOT (Final R_{15,j}) | SpecRoute (Peak R_{j,j}) | Δ |
-|---|------|-----------------------|--------------------------|---|
-| 1 | yelp | 56.01 | 54.36 | -1.65 |
-| 2 | amazon | 52.05 | 50.01 | -2.04 |
-| 3 | mnli | 34.07 | 35.50 | +1.43 |
-| 4 | cb | 3.57 | 0.00 | -3.57 |
-| 5 | copa | 42.00 | 44.00 | +2.00 |
-| 6 | qqp | 76.96 | 76.72 | -0.24 |
-| 7 | rte | 45.85 | 50.90 | +5.05 |
-| 8 | imdb | 89.51 | **0.21** ⚠️ | -89.30 |
-| 9 | sst2 | 85.21 | **0.00** ⚠️ | -85.21 |
-| 10 | dbpedia | 98.16 | 92.22 | -5.94 |
-| 11 | agnews | 88.37 | 68.76 | -19.61 |
-| 12 | yahoo | 57.28 | **8.12** ⚠️ | -49.16 |
-| 13 | multirc | 50.52 | 54.23 | +3.71 |
-| 14 | boolq | 60.43 | 61.13 | +0.70 |
-| 15 | wic | 55.49 | **0.00** ⚠️ | -55.49 |
-| | **Mean** | **59.70** | **39.74** | **-19.96** |
-> ⚠️ **LƯU Ý QUAN TRỌNG**: So sánh KHÔNG công bằng — ROOT dùng R_{15,j} (final, sau tất cả 15 tasks), SpecRoute dùng R_{j,j} (peak, ngay sau train từng task). AP thực của SpecRoute sẽ thấp hơn 39.74.
-### Phân tích
-**1. Prediction metrics không được lưu**
-- SpecRoute `all_results.json` chỉ chứa training metrics, KHÔNG có `predict_exact_match_for_{task}`
-- `task_order.txt` không tồn tại → `score.py` không thể tính AP/FT
-- Nguyên nhân: Có thể do experiment được chạy bằng script khác (không phải T5_small/ scripts đã fix `--do_predict`)
-- T5-large script generator (`generate_specroute_scripts_v2.py`) vẫn có bug `do_predict=False` cho long benchmarks
-**2. Các tasks THẤT BẠI KHÔNG PHẢI do catastrophic forgetting**
-| Task | Train Loss (Root) | Train Loss (SpecRoute) | Ratio | Verdict |
-|------|:-:|:-:|:-:|---|
-| imdb | 1.41 | **4.15** | 2.9x | Không thể học |
-| sst2 | 1.76 | **4.45** | 2.5x | Không thể học |
-| yahoo | 1.19 | **3.08** | 2.6x | Không thể học |
-| wic | 0.96 | **3.65** | 3.8x | Không thể học |
-Training loss cao gấp 2.5-3.8x → model KHÔNG THỂ HỌC ngay từ đầu (inability to learn, NOT catastrophic forgetting).
-**3. Nguyên nhân gốc: GPM null-space saturation + thiếu protection mechanisms**
-SpecRoute loại bỏ learned routing → đồng thời mất 4/5 cơ chế protection của ROOT:
-| Protection Mechanism | ROOT | SpecRoute V1 |
-|---------------------|:---:|:---:|
-| GPM on LoRA A | ✅ | ✅ |
-| KL distillation on routing | ✅ | ❌ |
-| Data replay | ❌ (`data_replay_freq=-1`) | ❌ |
-| Per-step GPM on routing params | ✅ | ❌ (no routing params) |
-| Learned routing adaptation | ✅ | ❌ (by design) |
-Khi tasks tương tự (imdb/sst2 vs yelp/amazon — cùng sentiment domain) đến, GPM đã "claim" sentiment-relevant directions → model bị ép vào orthogonal null-space không liên quan → KHÔNG thể học sentiment tasks mới.
-ROOT GainLoRA giải quyết vấn đề này nhờ trans_input MLP map input mới vào representation space REUSE kiến thức cũ, kết hợp KL distillation + data replay.
-**4. FT (Forgetting) = N/A**
-- Không tính được vì thiếu cross-task prediction metrics
-### Cải tiến cho V2
-| # | Loại | Nội dung | Tác động |
-|---|------|---------|----------|
-| 1 | Bug fix | Fix `do_predict=False` → `True` trong generator | Cho phép tính AP/FT đúng |
-| 2 | Config | Giảm GPM threshold: 0.995 → 0.980 | Mở rộng null-space cho tasks sau |
-| 3 | **Idea change** | Thêm Experience Replay (CE loss trên old task data) | Chống forgetting + hỗ trợ knowledge reuse |
 ---
-## Version 2.0 — SpecRoute V2: Zero-Replay, Cold-Start Fix + Fair Comparison
-### Thay đổi về Idea
-> **⚠️ V2.0 TRƯỚC ĐÓ ĐÃ BỊ HỦY**: Phiên bản V2 trước đó thêm Experience Replay (CE loss on old data).
-> Điều này **VI PHẠM** ràng buộc zero-replay trong settings.txt:
-> *"không được phép sử dụng lại bất kỳ dữ liệu cũ dưới bất kỳ hình thức nào"*
->
-> Hơn nữa, ROOT GainLoRA cũng **KHÔNG** dùng replay (`data_replay_freq=-1` cho TẤT CẢ scripts).
-> ROOT đạt AP=59.70 hoàn toàn nhờ: learned routing (trans_input + prompt_key) + GPM on LoRA_A + GPM on routing params.
->
-> **V2 Correct**: Fix root causes of V1 failure within zero-replay constraint.
-### Root Cause Analysis (V1 Failures)
-**Bug 1: Cold-Start — Code không match IDEA doc (Sec 2.2)**
-- IDEA doc (Section 2.2) quy định current task routing dùng **A rows trực tiếp**:
-  $$\text{fit}_\text{cur}(h) = \frac{\sum_{i=1}^{r} (a_i^\top h)^2}{r \cdot \|h\|^2}$$
-- Code V1 dùng **SVD(B@A)** cho current task. Nhưng B=0 tại initialization → SVD trả S=0 → fit≈0 → routing weight≈0 → gradient≈0 → B không thể học (dead loop)
-- A rows (kaiming init + null-space projection) luôn non-zero → fit_cur > 0 từ đầu
-**Bug 2: Training bias thiếu**
-- Ngay khi dùng A rows, fit_cur vẫn thấp hơn systematic so với old tasks (SVD-weighted σ²)
-- Old fit ∈ [0,1] (Rayleigh quotient), A-based fit ≤ 1/3 (do A normalized)
-- Current task nhận routing weight ~10-12% tại task 8+ → gradient yếu
-- Solution: training-time bias β=1.0 cộng vào fit_cur CHỈ khi training. Inference dùng SVD signatures bình thường
-**Bug 3: Batch size không fair**
-- V1: BSZ=64, GA=1, effective=64
-- ROOT: BSZ=8, GA=4, effective=32
-- SpecRoute dùng effective BSZ gấp đôi ROOT → so sánh không công bằng
-**Bug 4: GPM saturation (threshold=0.995)**
-- Sau 7 tasks, null-space bị thu hẹp nghiêm trọng
-- Sentiment tasks mới (imdb, sst2) bị ép vào directions orthogonal với yelp/amazon → không học được
-- Fix: threshold 0.995→0.980 (already in V1 analysis)
-### Kịch bản thử nghiệm
-- **Model**: T5-Small (d_model=512)
-- **Method**: SpecRoute V2 — A-row routing + training bias + lower threshold
-- **Hyperparameters**:
-  - lora_r=8, lora_alpha=32, lr=3e-4, 10 epochs
-  - **threshold=0.980** (giảm từ 0.995)
-  - **training_bias=1.0** (additive bias cho current task fit khi training)
-  - **data_replay_freq=-1** (KHÔNG replay, giống ROOT)
-  - BSZ=8, GA=4 trên A100 (effective=32, giống ROOT)
-  - BSZ=4, GA=8 trên T4-1gpu; BSZ=2, GA=8 trên T4-2gpu
-- **Script**: `T5_small/gen_script_long_order3_t5_small_specroute_v2.sh`
-### Code Changes (Actual)
-**1. Routing Fix: `t5_specroute.py`**
-- Current task: thay SVD(B@A) bằng A-row projection (match IDEA doc Sec 2.2)
-  ```python
-  # fit_cur(h) = Σ(a_i·h)² / (r·||h||²) — uses A rows directly
-  proj = torch.matmul(A.data, h_flat.T)  # (r, N)
-  fit = (proj ** 2).sum(dim=0) / (r * h_norm_sq)  # (N,)
-  ```
-- Training bias: `current_fit = current_fit + self.training_bias` (chỉ khi `model.training`)
-- Old tasks: giữ nguyên SVD-based σ-weighted Rayleigh quotient
-- Inference: tất cả tasks dùng SVD signatures (current task gets SVD after training)
-**2. Replay Removal: `cl_trainer_specroute.py`**
-- Xóa `create_memory_replay_generators()` function
-- Xóa replay parameters từ `__init__` (data_collator_replay, replay_dataset_dict)
-- Xóa replay block từ `training_step()` — chỉ giữ CE loss + gradient diagnostic
-- Training step: standard CE → backward → gradient check → return loss
-**3. Run entry: `run_t5.py`**
-- Thêm `training_bias` vào ModelArguments (default=1.0)
-- Pass `training_bias` qua `prompt_config` dict
-- Xóa SpecRoute-specific replay loading condition
-- Xóa `data_collator_replay`, `replay_dataset_dict` từ SpecRoute_Trainer call
-**4. Shell Script: `T5_small/gen_script_long_order3_t5_small_specroute_v2.sh`**
-- data_replay_freq: 5 → **-1** (disabled, match ROOT)
-- kl_ratio: removed, replaced with **training_bias=1.0**
-- BSZ/GA: match ROOT exactly (A100: 8/4, T4-1gpu: 4/8, T4-2gpu: 2/8)
-- threshold/transthreshold: 0.980 (kept from previous)
-### Kết quả
-| # | Task | ROOT EM | V2 EM | Δ | Ghi chú |
-|---|------|---------|-------|---|---------|
-| 1 | yelp | 56.01 | 35.91 | -20.10 | Below |
-| 2 | amazon | 52.05 | 36.58 | -15.47 | Below |
-| 3 | mnli | 34.07 | 0.25 | -33.82 | Catastrophic forgetting (peak 31.25 ep8) |
-| 4 | cb | 3.57 | 0.00 | -3.57 | EM=0 — misrouted garbage output |
-| 5 | copa | 42.00 | **47.00** | **+5.00** | ✅ Better |
-| 6 | qqp | 76.96 | **77.03** | **+0.07** | ✅ Tie |
-| 7 | rte | 45.85 | 0.36 | -45.49 | Catastrophic forgetting (peak 51.26 ep4) |
-| 8 | imdb | 89.51 | 0.00 | -89.51 | ❌ EM=0 — pred "positive"/"negative" vs label "Good"/"Bad" |
-| 9 | sst2 | 85.21 | 0.00 | -85.21 | ❌ EM=0 — pred "negative" vs label "Bad" |
-| 10 | dbpedia | 98.16 | 71.95 | -26.21 | Below |
-| 11 | agnews | 88.37 | 68.21 | -20.16 | Below |
-| 12 | yahoo | 57.28 | 6.82 | -50.46 | Very low |
-| 13 | multirc | 50.52 | **55.42** | **+4.90** | ✅ Better |
-| 14 | boolq | 60.43 | **61.44** | **+1.01** | ✅ Better |
-| 15 | wic | 55.49 | 0.00 | -55.49 | ❌ EM=0 — pred "the same meaning" vs label "True" |
-| | **AP(EM)** | **59.70** | **30.73** | **-28.97** | |
-| | **AP(rougeL)** | **61.66** | **38.00** | **-23.66** | |
-### Phân tích chi tiết
-**Nhóm 1: EM=0 do MISROUTING (4 tasks)**
-- imdb pred "positive"/"negative" → đây là label vocabulary của yelp/amazon → routing gửi input imdb đến LoRA cũ
-- sst2 pred "negative" → tương tự, routed to yelp LoRA
-- wic pred "the same meaning"/"different" → label đúng là "True"/"False" → routed to wrong expert
-- cb pred gibberish ("bedroom", "virtuous") → completely misrouted
-**Nhóm 2: Catastrophic forgetting (2 tasks)**
-- mnli: EM peak=31.25 tại ep8, nhưng final=0.25 → degenerate (always "neutral")
-- rte: EM peak=51.26 tại ep4, final=0.36 → overfit rồi collapse
-**ROOT CAUSE: Constant β=1.0 không scale theo số task**
-| n_tasks | Training w_cur | Inference w_cur | Gap |
-|---------|---------------|----------------|-----|
-| 1 | 100% | 100% | 1.0x |
-| 2 | 71.5% | 48.0% | 1.5x |
-| 8 (imdb) | **26.4%** | **11.7%** | 2.3x |
-| 15 (wic) | **15.2%** | **6.2%** | 2.5x |
-Task 8 (imdb) chỉ nhận 26.4% routing weight khi training → 73.6% gradient đi qua LoRA cũ → model học label vocabulary của task cũ thay vì task hiện tại.
-**SECONDARY CAUSE: A-row fit vs SVD fit asymmetry**
-- Training: current task dùng A-row fit (uniform weighting)
-- Inference: current task VẪN dùng A-row fit (không có bias) nhưng old tasks dùng SVD fit (σ²-weighted)
-- SVD fit hệ thống cao hơn A-row fit → old tasks luôn thắng routing tại inference
-### Kỳ vọng
-- Cold-start fix → giải quyết EM=0 ở task 1-3 ✅
-- Training bias β=1.0 → chỉ đủ cho ≤3 tasks, KHÔNG đủ cho 8+ tasks ❌
----
-## Version 3.0 — SpecRoute V3: Adaptive Bias + Symmetric Inference Routing
-### Thay đổi về Methodology (CẬP NHẬT SPECROUTE_IDEA.md)
-**1. Adaptive Training Bias (thay thế constant β=1.0):**
-$$\beta(n) = \tau \cdot \ln\!\left(\frac{\alpha_{\mathrm{target}} \cdot n}{1 - \alpha_{\mathrm{target}}}\right)$$
-- $n$ = số old tasks = `len(spectral_signatures)`
-- $\alpha_{\mathrm{target}}$ = target routing weight (default 0.8)
-- Đảm bảo w_cur ≈ 80% bất kể tổng số task
-- Derivation từ giải phương trình softmax: xem SPECROUTE_IDEA.md Section C2
-**2. Symmetric Inference Routing (thay thế A-row fit tại inference):**
-- Sau training, B≠0 → SVD(B@A) cho meaningful signatures
-- Gọi `prepare_inference_routing()` trước prediction
-- Inference: TẤT CẢ tasks (kể cả current) dùng cùng σ²-weighted Rayleigh quotient
-- Loại bỏ hoàn toàn asymmetry A-row vs SVD → measurement symmetry
-**3. Threshold 0.995 (match ROOT, thay vì 0.980):**
-- Bảo toàn null-space capacity cho tasks sau
-- Capacity: d/(r·(1-ε)) = 512/(8·0.005) = 12,800 tasks (rất dư)
-### Code Changes
-**`t5_specroute.py`:**
-- `compute_spectral_routing()`:
-  - Training: A-row fit + β(n) tự động từ len(spectral_signatures)
-  - Inference: dùng `_current_task_svd` (SVD-based fit) cho current task
-- Thêm `prepare_inference_routing()`: tính SVD(B@A) cho current task's LoRA
-- Thêm `_target_routing_alpha` config parameter
-- Xóa `training_bias` cố định
-**`run_t5.py`:**
-- Thêm `target_routing_alpha` argument (default 0.8)
-- Gọi `model.encoder.prepare_inference_routing()` trước inference
-**Shell script: `T5_small/gen_script_long_order3_t5_small_specroute_v3.sh`:**
-- `--target_routing_alpha 0.8` (thay `--training_bias 1.0`)
-- `--threshold 0.995` (thay 0.980)
-- `--transthreshold 0.995` (thay 0.980)
-### Kỳ vọng
-- Adaptive bias → tasks 8+ nhận ≈80% routing weight → có thể học đúng label vocabulary
-- Symmetric inference → routing chính xác hơn tại eval → EM>0 cho imdb/sst2/wic
-- Threshold 0.995 → bảo vệ tốt hơn + routing margin lớn hơn (Theorem 1)
-### Kết quả (Long Order 3, T5-Small, 10 epochs/task, threshold=0.995, α=0.8)
-| # | Task | ROOT EM | V3 EM (Final) | Δ EM | V3 rougeL | ROOT rougeL |
-|---|------|---------|---------------|------|-----------|-------------|
-| 1 | yelp | 56.01 | 35.96 | -20.05 | 62.36 | — |
-| 2 | amazon | 52.05 | 36.63 | -15.42 | 61.98 | — |
-| 3 | mnli | 34.07 | 0.07 | -34.00 | 0.07 | — |
-| 4 | cb | 3.57 | 0.00 | -3.57 | 0.00 | — |
-| 5 | copa | 42.00 | 46.00 | **+4.00** | 46.00 | — |
-| 6 | qqp | 76.96 | 76.96 | **+0.00** | 76.96 | — |
-| 7 | rte | 45.85 | 0.00 | -45.85 | 14.80 | — |
-| 8 | imdb | 89.51 | 0.00 | -89.51 | 0.02 | — |
-| 9 | sst2 | 85.21 | 0.00 | -85.21 | 0.00 | — |
-| 10 | dbpedia | 98.16 | 48.83 | -49.33 | 57.60 | — |
-| 11 | agnews | 88.37 | 53.70 | -34.67 | 59.83 | — |
-| 12 | yahoo | 57.28 | 1.34 | -55.94 | 3.09 | — |
-| 13 | multirc | 50.52 | 53.73 | **+3.21** | 53.73 | — |
-| 14 | boolq | 60.43 | 61.65 | **+1.22** | 61.65 | — |
-| 15 | wic | 55.49 | 0.00 | -55.49 | 0.00 | — |
-| | **AP** | **59.70** | **33.77** | **-25.93** | **41.17** | **61.66** |
-> Kết quả chạy trên Kaggle T4, log tại `logs/t5_small_improve/log_script_long_order3_t5_small_specroute_v3`
-### Phân tích chi tiết V3
-**Cải thiện so với V2** (V2: AP=30.73 → V3: AP=33.77, +3.04 pts):
-- rte: EM=0.36→0 (giảm, routing inference vẫn lỗi)
-- dbpedia: 71.95→48.83 (giảm — regression!), agnews: 68.21→53.70 (regression)
-- copa: 47.00→46.00 (tương đương), multirc: 55.42→53.73 (tương đương)
-- yelp: 35.91→35.96, amazon: 36.58→36.63 (không đổi)
-- Phần cải thiện chủ yếu từ threshold 0.995 bảo vệ null-space tốt hơn
-**⚠️ REGRESSION so với V2**: dbpedia và agnews giảm đáng kể → Symmetric SVD inference WORSE hơn A-row inference cho multi-class tasks!
-**3 nhóm vấn đề tồn tại**:
-| Nhóm | Tasks | Biểu hiện | Root cause |
-|------|-------|-----------|------------|
-| 1. Không học được (training failure) | cb, imdb, sst2, wic, yahoo | train_loss cao (>1.34), eval_em=0 từ đầu | GPM null-space saturation → A_k ⊥ sentiment subspace |
-| 2. Catastrophic forgetting | yelp, amazon, rte | EM peak cao (54%, 48%, 21%), final thấp hơn | Routing accuracy giảm khi có nhiều tasks |
-| 3. Mode collapse | mnli | EM stuck ≈31% (= 1/3 priors) | Model collapse sang "neutral" mode |
-**Nguyên nhân gốc — GPM-induced Routing Ambiguity (Định lý)**:
-Gọi $A_k \in \mathbb{R}^{r \times d}$ là LoRA-A của task $k$, các hàng được GPM-project orthogonal với $\{A_1, ..., A_{k-1}\}$. Với task $k' > k$ có phân phối input tương tự ($P(h|k') \approx P(h|k)$), spectral score:
-$$\text{score}(h; A_{k'}) = \frac{1}{r}\sum_{i=1}^r \frac{(a_i^{(k')} \cdot h)^2}{\|h\|^2} \leq \text{score}(h; A_k)$$
-*bất đẳng thức này đúng với mọi $h \sim P(h|k')$* vì $A_{k'}$ bị ép orthogonal với dominant subspace của $k$, mà dominant subspace đó cũng là dominant subspace của $k'$.
-**Hệ quả trực tiếp**: imdb inputs → score cao với yelp LoRA hơn imdb LoRA → routed sai.
 ---
-## Version 2.1 — Performance Optimization (Thin QR+SVD)
-### Vấn đề
-SpecRoute V1 dùng full SVD(512×512) per forward pass dù rank(B@A)≤8. Lãng phí compute.
-### Tối ưu: Thin QR+SVD (ZERO accuracy loss)
-**Áp dụng cho**: `compute_spectral_signatures()` (offline, after training).
-**KHÔNG áp dụng cho**: current task routing (V2 dùng A rows → không cần SVD).
-**Nguyên lý toán học**: Vì rank(B@A) ≤ r = 8, ta decompose qua 2 QR nhỏ + 1 SVD 8×8:
-1. QR(B) → Q_B(512×8), R_B(8×8) — cost O(m·r²)
-2. QR(A^T) → Q_A(512×8), R_A(8×8) — cost O(n·r²)
-3. SVD(R_B @ R_A^T) → U_s, S, Vh_s — cost O(r³) = O(512) operations
-4. Vt_full = Vh_s @ Q_A^T — cost O(n·r²)
-**Nghĩa toán học ĐỒNG NHẤT** — không phải approximation.
-**Benchmark (CPU, 512×512 matrix, r=8)**:
-- Full SVD: 12.55 ms/call → 150.6 ms per forward (12 calls)
-- Thin QR+SVD: 0.067 ms/call → 0.8 ms per forward
-- **Speedup: 186×**
-- Relative error: ~1e-6 (machine precision)
-### Code Changes
-**`t5_specroute.py`**:
-- Thêm hàm `_thin_svd_low_rank(B, A, device)`: QR decomposition + SVD 8×8 + recover
-- `compute_spectral_routing()`: thay `torch.linalg.svd(B@A, ...)` bằng `_thin_svd_low_rank(B, A)`
-- `compute_spectral_signatures()`: tương tự
-### Tác động
-| Component | Trước | Sau |
-|-----------|-------|-----|
-| SVD per signature compute | ~12.55ms | ~0.067ms |
-| Speedup | — | **186×** |
-| Accuracy loss | — | 0 (exact, error ~1e-6) |
-> V2 không còn dùng SVD per forward cho current task (dùng A rows thẳng).
-> Thin QR+SVD chỉ dùng cho `compute_spectral_signatures()` sau khi training xong mỗi task.
-### Đề xuất
-V2 đã tắt replay (`data_replay_freq=-1`), match ROOT. Runtime ước tính ngang ROOT (~4-5h trên T4).
 ---
-## Version 4.0 — SpecRoute V4: Spectrally-Conditioned LoRA Training (C4)
-### Motivation
-V3 addresses routing and protection, but single-task LoRA quality remains limited by:
-1. **Gradient distortion**: Frozen A (after InfLoRA null-space projection) has non-orthogonal columns → B gradients are distorted
-2. **Low effective rank**: CE loss alone doesn't encourage full utilization of LoRA's rank-r budget
-### Methodology (C4)
-Two complementary components:
-**C4.1 Preconditioned Gradient**: Apply $(AA^T + \epsilon I)^{-1/2}$ to B's gradient after backward, equalizing gradient magnitudes across all rank directions. Computed ONCE after `get_reg_matrix()` (A is frozen → constant preconditioner).
-**C4.2 Spectral Entropy Regularization**: $\mathcal{L} = \mathcal{L}_{CE} + \lambda \sum_\ell (\log r - H_\ell)$ where $H_\ell$ is spectral entropy of the $\ell$-th LoRA layer. Efficient QR trick: $O(r^3)$ instead of full SVD.
 ### Hyperparameters
-| Parameter | Value | Role |
-|-----------|-------|------|
-| `lambda_entropy` | 0.01 | Weight of spectral entropy loss |
-| `use_preconditioning` | True | Enable gradient preconditioning |
-| `precond_eps` | 1e-6 | Numerical stability |
-| `entropy_warmup_ratio` | 0.1 | 10% warmup before enabling entropy loss |
-### Code Changes
-1. **`cl_trainer_specroute.py`**:
-   - Added C4 params to `__init__` (lambda_entropy, use_preconditioning, precond_eps, entropy_warmup_ratio)
-   - Added `precompute_preconditioners()`: eigendecomposition of AA^T → $(AA^T+\epsilon I)^{-1/2}$
-   - Added `_compute_spectral_entropy_loss()`: QR trick → SVD of r×r matrix → entropy
-   - Added `_apply_preconditioning()`: post-backward gradient modification
-   - Modified `training_step()`: entropy loss + preconditioning
-2. **`run_t5.py`**:
-   - 4 new args: lambda_entropy, use_preconditioning, precond_eps, entropy_warmup_ratio
-   - Pass to SpecRoute_Trainer constructor
-   - Call `precompute_preconditioners()` after `get_reg_matrix()`
-3. **V4 shell script**: `gen_script_long_order3_t5_small_specroute_v4.sh`
-### Ablation Plan
-| Experiment | Precond | Entropy | Purpose |
-|------------|---------|---------|---------|
-| V3 (baseline) | ✗ | ✗ | Current best |
-| V4a | ✓ | ✗ | Isolate preconditioning |
-| V4b | ✗ | ✓ | Isolate entropy |
-| V4 (full) | ✓ | ✓ | Full C4 |
-### Kỳ vọng
-- Preconditioning: faster convergence, especially for tasks where A has high condition number
-- Entropy: higher effective rank → richer LoRA representations → better generalization
-- Combined: both effects are orthogonal and additive
-- Risk: entropy regularization may hurt tasks that genuinely need low-rank updates (mitigated by warmup + modest λ)
-### Bug Fixes (trước khi chạy)
-3 bugs được phát hiện và fix trong `cl_trainer_specroute.py`:
-| # | Bug | Fix |
-|---|-----|-----|
-| 1 | `A.T @ A` → shape (512,512) thay vì (8,8) preconditioner | Sửa thành `A @ A.T` |
-| 2 | Cross-attention layers có d_out≠d_in → `assert` crash | Thay `assert` bằng `continue` |
-| 3 | `nan_to_num_` guard bị xóa → NaN gradients | Khôi phục sau preconditioning |
-> Tất cả 3 bugs đều về `precompute_preconditioners()` / `_apply_preconditioning()`. V4 log (`log_script_long_order3_t5_small_specroute_v4`) = **0 bytes** — experiment bị crash trước khi output do bug #1.
-### Kết quả
-*(Chưa chạy ��� cần chạy lại sau khi fix bugs)*
----
-## Version 5.0 — SpecRoute V5: Prototype Routing (Giải quyết GPM-induced Routing Ambiguity)
-### Động lực: Phân tích Điều kiện Trực giao
-**Câu hỏi**: Có cần nới lỏng orthogonality constraint (GPM null-space projection) hay không?
-**Phân tích:**
-ROOT GainLoRA dùng **cùng** GPM orthogonality trên LoRA-A (InfLoRA) và đạt AP=59.70. Vậy orthogonality không phải là bottleneck — routing mới là vấn đề.
-GPM orthogonality phục vụ 2 mục đích:
-1. **Protection**: Đảm bảo $\nabla_{B_k}$ không interferent với $B_j A_j$ cũ → **CẦN THIẾT**
-2. **Routing signal** (trong V1-V3): Spectral fit dùng LoRA subspace → **BỊ HAI** khi tasks cùng domain
-> **Kết luận**: Giữ nguyên strict orthogonality cho protection (backbone giống ROOT). Tách biệt routing khỏi LoRA subspace bằng prototype routing.
-**Về "suy kiệt không gian"**: Với d=512, r=8, ε=0.995: capacity = d/(r·(1-ε)) = 12,800 tasks. Không gian không suy kiệt về mặt lượng. Vấn đề thực sự: *các hướng quan trọng* (sentiment subspace) bị captured bởi task đầu tiên → tasks sau không thể dùng các hướng đó cho LoRA-A → representation bị hạn chế. Nhưng ROOT cũng bị hạn chế tương tự và vẫn đạt AP=59.70 nhờ routing tốt.
-### Lý thuyết: GPM-Routing Paradox
-> **GPM-Routing Paradox**: GPM ép $A_k \perp A_{k'}$ cho $k' > k$. Với tasks cùng domain, $A_{k'}$ bị tách khỏi dominant input subspace. Spectral routing đo alignment với LoRA subspace → misroute.
-$$P(h|k') \approx P(h|k) \implies \alpha_{k'}(h) \ll \alpha_k(h) \quad \forall h \sim P(h|k')$$
-**Giải pháp: Prototype routing trong input embedding space** (decoupled from GPM):
-$$w(h) = \text{softmax}\!\left(\frac{[\cos(h, \mu_1), \ldots, \cos(h, \mu_T)]}{\tau}\right)$$
-$\mu_k$ = normalized running mean of attention-masked input embeddings during task $k$ training.
-**Lý thuyết nền tảng (LDA — Linear Discriminant Analysis)**:
-Dưới Gaussian mixture $P(h|k) = \mathcal{N}(\mu_k, \Sigma)$ với shared covariance, nearest centroid classification là Bayes-optimal. Cosine similarity trên normalized centroids tương đương nearest centroid cho unit-norm data.
-Prototype routing:
-- **GPM-immune**: μ_k sống trong embedding space, không bị GPM project
-- **Zero-replay**: chỉ cần running mean (d scalars per task)
-- **Drift-free**: frozen embedding table → μ_k stationary (Proposition 1)
-- **Same-domain discriminable**: μ_yelp ≠ μ_imdb vì vocabulary khác (restaurant vs movie)
-### Code Changes (ĐÃ IMPLEMENT)
-**`t5_specroute.py`:**
-- `T5Stack.__init__()`: thêm `task_prototypes`, `_current_prototype_sum/count`, `_current_task_prototype`
-- `_update_prototype(h_batch)`: accumulate running mean (gọi tự động trong `forward()`)
-- `finalize_prototype()`: normalize prototype sau khi train xong
-- `compute_spectral_routing()`:
-  - Training: A-row fit + adaptive β (KHÔNG ĐỔI)
-  - Inference: prototype cosine similarity (MỚI) khi có đủ prototypes, fallback spectral nếu không
-- `T5Stack.forward()`:
-  - Fix masked mean: `.sum() / mask_count` thay `.mean()` (tránh dilution bởi padding)
-  - Tự động gọi `_update_prototype()` trong mọi training forward pass (kể cả task 1)
-**`run_t5.py`:**
-- Load task prototypes cùng lúc spectral signatures
-- Gọi `finalize_prototype()` + save `task_prototype.pt` sau training
-**`SPECROUTE_IDEA.md`:**
-- Thêm C2.1 Prototype Routing (inference-time, V5)
-- Cập nhật Code-Idea Alignment table
-**`gen_script_long_order3_t5_small_specroute_v5.sh`:**
-- Copy từ V4 (giữ nguyên C4 params: preconditioning + entropy)
-- Prototype routing tự động khi có prototype files — KHÔNG cần flag mới
-### So sánh Architecture
-| Component | ROOT | V3 (Spectral) | V5 (Prototype) |
-|-----------|------|---------------|----------------|
-| Training routing | Learned MLP | A-row + adaptive β | A-row + adaptive β |
-| Inference routing | Learned MLP | SVD spectral fit | **Prototype cosine** |
-| Routing params | trans_input + prompt_key | None | **μ_k (512 per task)** |
-| GPM on routing | ✓ (trans_input) | ✗ | ✗ |
-| Same-domain | ✓ (learned) | ✗ (paradox) | **✓ (vocabulary diff)** |
-| Protection | GPM LoRA-A | GPM LoRA-A | GPM LoRA-A |
-| Single-task quality | Standard CE | Standard CE | **C4 (precond + entropy)** |
-### Kỳ vọng
-- **imdb, sst2, wic**: EM > 0 → prototype discriminates vocabulary distributions
-- **yelp, amazon**: EM regain (giảm forgetting) → đúng routing
-- **mnli**: mode collapse giảm (phụ thuộc prototype quality cho NLI)
-- **AP(EM) target**: > 45 (prototype fixes routing + C4 improves quality)
-### Kết quả
-*(Chưa chạy — cần chạy `gen_script_long_order3_t5_small_specroute_v5.sh`)*
 ---
 ## Changelog
-| Date | Version | Change Type | Description |
-|------|---------|-------------|-------------|
-| 2025-XX-XX | V1.0 | Initial | First experiment — baseline SpecRoute vs ROOT GainLoRA |
-| 2025-XX-XX | V2.0 (hủy) | ~~Replay~~ | ~~Thêm experience replay~~ — **BỊ HỦY** do vi phạm zero-replay constraint |
-| 2025-XX-XX | V2.0 | Bug fix + Fair | A-row routing (fix cold-start), training bias β=1.0, threshold 0.980, fair BSZ=32 |
-| 2025-XX-XX | V2.1 | Perf Optimization | Thin QR+SVD (~186× speedup per SVD, zero accuracy loss) |
-| 2026-03-17 | V2.0 | **Results** | AP(EM)=30.73 vs ROOT=59.70. 4 tasks EM=0 (imdb/sst2/wic/cb misrouting), 2 catastrophic forgetting |
-| 2026-03-17 | V3.0 | **Methodology** | Adaptive bias β(n)=τ·ln(α·n/(1-α)), symmetric SVD inference routing, threshold→0.995 |
-| 2026-03-17 | V4.0 | **C4 Implementation** | Preconditioned gradient + spectral entropy regularization for single-task LoRA quality |
-| 2026-03-19 | V3.0 | **Results** | AP(EM)=33.77, AP(rougeL)=41.17. imdb/sst2/wic/cb/rte/mnli vẫn fail — GPM routing ambiguity confirmed |
-| 2026-03-19 | V4.0 | **Bug Fix** | 3 bugs fixed: A@A.T, assert→continue cross-attention, nan_to_num_ guard. Log was 0 bytes (crash) |
-| 2026-03-19 | V5.0 | **Proposal** | Prototype routing — replace spectral SVD fit with cosine distance to task mean embeddings |
-| 2026-03-19 | V5.0 | **Implementation** | Prototype routing (3 bugs fixed: init scope, cosine shape, eval prototype). Spectral entropy QR bug fixed (pre-existing). Separate prototype_temperature=0.01. Diagnostic logging. |

+# Experiment Versions — SpecRoute (T5-small, Long Order 3)
+**Backbone**: google/flan-t5-small (d_model=512, 8 enc+8 dec layers)
+**LoRA**: r=8, target=Q+V, InfLoRA (only B trained, A frozen kaiming)
+**Task order (15)**: yelp → amazon → mnli → cb → copa → qqp → rte → imdb → sst2 → dbpedia → agnews → yahoo → multirc → boolq → wic
+**Settings**: zero-replay, lr=3e-4, epochs=10, threshold=0.995
 ---
+## ROOT Baseline — GainLoRA + InfLoRA
+**Script**: `gen_script_long_order3_t5_small_gainlora_inflora.sh`
+**Method**: Learned MLP routing (trans_input + prompt_key), LoRA GPM (ESA), KL distill
+### Kết quả cuối (sau 15 task)
+| Task | EM | rougeL |
+|------|---:|-------:|
+| yelp | 56.01 | 70.36 |
+| amazon | 52.05 | 67.08 |
+| mnli | 34.07 | 34.07 |
+| cb | 3.57 | 3.57 |
+| copa | 42.00 | 42.00 |
+| qqp | 76.96 | 76.96 |
+| rte | 45.85 | 45.85 |
+| imdb | 89.51 | 89.51 |
+| sst2 | 85.21 | 85.21 |
+| dbpedia | 98.16 | 98.16 |
+| agnews | 88.37 | 88.38 |
+| yahoo | 57.28 | 57.35 |
+| multirc | 50.52 | 50.52 |
+| boolq | 60.43 | 60.43 |
+| wic | 55.49 | 55.49 |
+| **AP (unweighted)** | **59.70** | **61.66** |
+| **AP (weighted)** | **67.28** | **70.44** |
 ---
+## V2 — Spectral Routing (SpecRoute ban đầu)
+**Thay đổi so với ROOT**: Thay MLP routing bằng spectral routing (SVD-based Rayleigh quotient), bỏ KL distill + data replay.
+**AP(EM)**: 30.73 (unweighted)
+**Vấn đề chính**: 4 task EM=0 (imdb, sst2, cb, wic) do misrouting → label format mismatch. mnli, rte catastrophic forgetting.
 ---
+## V3 — Adaptive Bias + Symmetric Inference Routing
+**Thay đổi**:
+- Adaptive training bias: β = T·ln(α·n_old/(1−α)), target_routing_alpha=0.8
+- Symmetric inference routing: prepare_inference_routing() cho current task
+- Threshold 0.995 (matches ROOT)
+**AP(EM)**: 27.66 (unweighted) — **REGRESSION** từ V2
+**Nguyên nhân**: V3 code chạy bằng V2 SCRIPT (threshold=0.98 thay vì 0.995). Ngoài ra, train-inference routing mismatch vẫn tồn tại.
+**Kết luận V2-V3**: SVD routing CÓ HẠN CHẾ CẤU TRÚC — không phân biệt được same-domain tasks (do GPM forces A_k ⊥ A_j → spectral signatures collapse).
 ---
+## V5 — Prototype Routing + Preconditioning + Entropy
+**Script**: `T5_small/gen_script_long_order3_t5_small_specroute_v5.sh`
+**Logs**: `/logs/t5_small_improve/gen_script_long_order3_t5_small_specroute_v5/`
+### Motivation (GPM-Routing Paradox)
+- GPM forces A_k ⊥ A_j → spectral routing via SVD(B@A) fails for same-domain tasks
+- ROOT's MLP routing works vì learned parameters ≠ LoRA subspace
+- **Giải pháp**: Prototype routing at inference — cosine similarity giữa encoder embedding và task prototypes (running mean during training)
+### Thay đổi code (6 bugs fixed across 4 dev/review cycles)
+1. **Prototype routing** (t5_specroute.py): `_update_prototype()`, `finalize_prototype()`, dual-mode inference
+2. **Entropy QR fix** (cl_trainer_specroute.py): `qr(B.T)→qr(B)`, `qr(A)→qr(A.T)`
+3. **Prototype temperature** T_proto=0.01 (tách biệt T_train=1.0)
+4. **Preconditioning** (C4): lambda_entropy=0.01, use_preconditioning=True
+5. **Init scope fix**: Prototype fields under `if not self.is_decoder`
+6. **Cosine shape fix**: `.squeeze(-1)` → correct (B,) shape
 ### Hyperparameters
+- lambda_entropy = 0.01
+- use_preconditioning = True
+- threshold = 0.995
+- target_routing_alpha = 0.8
+- lora_r = 8
+- learning_rate = 3e-4
+- num_train_epochs = 10
+### Kết quả V5 (so sánh ROOT)
+| Task | ROOT EM | V5 EM | Δ | ROOT rougeL | V5 rougeL | Δ |
+|------|--------:|------:|--:|------------:|----------:|--:|
+| yelp | 56.01 | 54.64 | -1.37 | 70.36 | 70.45 | +0.09 |
+| amazon | 52.05 | 48.01 | -4.04 | 67.08 | 66.22 | -0.86 |
+| mnli | 34.07 | 33.92 | -0.15 | 34.07 | 33.92 | -0.15 |
+| cb | 3.57 | 0.00 | -3.57 | 3.57 | 0.00 | -3.57 |
+| copa | 42.00 | 44.00 | +2.00 | 42.00 | 44.00 | +2.00 |
+| qqp | 76.96 | 77.83 | +0.87 | 76.96 | 77.83 | +0.87 |
+| rte | 45.85 | 48.01 | +2.17 | 45.85 | 52.71 | +6.86 |
+| imdb | 89.51 | 88.61 | -0.90 | 89.51 | 88.61 | -0.90 |
+| sst2 | 85.21 | 81.19 | -4.01 | 85.21 | 81.19 | -4.01 |
+| dbpedia | 98.16 | 97.67 | -0.49 | 98.16 | 97.83 | -0.33 |
+| agnews | 88.37 | 89.74 | +1.37 | 88.38 | 89.83 | +1.45 |
+| yahoo | 57.28 | 49.66 | -7.62 | 57.35 | 50.37 | -6.98 |
+| multirc | 50.52 | 60.44 | +9.92 | 50.52 | 60.44 | +9.92 |
+| boolq | 60.43 | 61.01 | +0.58 | 60.43 | 61.01 | +0.58 |
+| wic | 55.49 | 58.46 | +2.98 | 55.49 | 58.46 | +2.98 |
+| **AP (unwt)** | **59.70** | **59.55** | **-0.15** | **61.66** | **62.19** | **+0.53** |
+| **AP (wt)** | **67.28** | **66.65** | **-0.63** | **70.44** | **70.42** | **-0.02** |
+### So sánh với V3 (prototype routing fix)
+V3 có 6 task fail (EM≈0): imdb, sst2, wic, cb, rte, mnli. V5 kết quả:
+- **imdb**: 0 → 88.61 ✅ FIXED
+- **sst2**: 0 → 81.19 ✅ FIXED
+- **wic**: 0 → 58.46 ✅ FIXED
+- **rte**: 0 → 48.01 ✅ FIXED
+- **mnli**: 0 → 33.92 ✅ FIXED (≈ ROOT 34.07)
+- **cb**: 0 → 0.00 ❌ STILL BROKEN
+### Forgetting Analysis (V5)
+| Task | After Training | Final (15-wic) | Forgetting |
+|------|---------------:|---------------:|-----------:|
+| yelp | 55.82 | 54.64 | -1.17 |
+| amazon | 48.84 | 48.01 | -0.83 |
+| mnli | 34.39 | 33.92 | -0.47 |
+| cb | 0.00 | 0.00 | 0.00 |
+| copa | 44.00 | 44.00 | 0.00 |
+| qqp | 77.82 | 77.83 | +0.01 |
+| rte | 52.71 | 48.01 | -4.69 |
+| imdb | 89.46 | 88.61 | -0.86 |
+| sst2 | 81.42 | 81.19 | -0.23 |
+| dbpedia | 98.18 | 97.67 | -0.51 |
+| agnews | 89.92 | 89.74 | -0.18 |
+| yahoo | 51.96 | 49.66 | -2.30 |
+| multirc | 61.90 | 60.44 | -1.46 |
+| boolq | 61.01 | 61.01 | 0.00 |
+| wic | 58.46 | 58.46 | 0.00 |
+| **Average** | | | **-0.85** |
+### Training Loss per Task
+| Task | Samples | Train Loss | Note |
+|------|--------:|-----------:|------|
+| yelp | 5000 | 0.4017 | OK |
+| amazon | 5000 | 0.4064 | OK |
+| mnli | 3000 | 0.6986 | Moderate |
+| cb | 250 | **4.3962** | ❌ KHÔNG HỌC ĐƯỢC |
+| copa | 400 | 0.4071 | OK |
+| qqp | 2000 | 0.2785 | Good |
+| rte | 2000 | 0.7683 | Moderate-high |
+| imdb | 2000 | 0.0993 | Very good |
+| sst2 | 2000 | 0.3200 | Good |
+| dbpedia | 14000 | 0.0536 | Excellent |
+| agnews | 4000 | 0.1756 | Good |
+| yahoo | 10000 | 0.5820 | Moderate |
+| multirc | 2000 | 0.5153 | Moderate |
+| boolq | 2000 | 0.3958 | OK |
+| wic | 2000 | 0.4847 | Moderate |
+### CB Failure Analysis
+CB là task 4 (sớm trong sequence), chỉ có **250 samples** → 8 steps/epoch → **80 total steps**.
+- Loss curve: 5.25 → 3.63 (giảm nhưng chưa converge, eval_em=0.00 suốt)
+- ROOT cũng gần-fail trên CB (EM=3.57) — CB là inherently difficult task với quá ít data
+- Đây KHÔNG PHẢI lỗi routing — đây là lỗi single-task learning quality với extreme low resources
+### Đánh giá tổng thể V5
+**Thành công lớn**: Prototype routing giải quyết GPM-Routing Paradox — 5/6 task từ EM≈0 lên mức ≈ROOT.
+**AP ngang ROOT**: V5 AP(EM)=59.55 ≈ ROOT 59.70 (-0.15). V5 AP(rougeL)=62.19 > ROOT 61.66 (+0.53).
+**Forgetting rất thấp**: Average forgetting chỉ -0.85 (excellent cho 15-task sequence).
+**Vấn đề còn tồn tại**:
+1. **CB = 0.00**: Extreme low-resource task (250 samples). ROOT cũng gần-fail (3.57). Cần strategy riêng cho tiny datasets.
+2. **yahoo -7.62 vs ROOT**: V5 yahoo after_training=51.96, ROOT yahoo=57.28. Đây là single-task quality issue, không phải forgetting.
+3. **sst2 -4.01 vs ROOT**: V5=81.19 vs ROOT=85.21. Tương tự, single-task gap.
+4. **amazon -4.04 vs ROOT**: V5=48.01 vs ROOT=52.05.
+5. **rte forgetting -4.69**: Cao nhất trong tất cả tasks. rte trained=52.71 nhưng bị decay → 48.01.
 ---
 ## Changelog
+| Date | Version | AP(EM) unwt | AP(rougeL) unwt | Key Change |
+|------|---------|------------:|----------------:|------------|
+| - | ROOT (baseline) | 59.70 | 61.66 | GainLoRA + InfLoRA |
+| - | V2 | ~30.73 | - | Spectral routing (SVD) |
+| - | V3 | ~27.66 | - | Adaptive bias + symmetric (WRONG script) |
+| - | V5 | **59.55** | **62.19** | Prototype routing + entropy + preconditioning |

results/review_1.md DELETED Viewed

@@ -1,154 +0,0 @@
-# Review Round 1 — V5 Prototype Routing
-**Reviewer role**: Objective reviewer (code + theory + methodology)
-**Scope**: V5 implementation (t5_specroute.py, run_t5.py), SPECROUTE_IDEA.md theory, experiment design
----
-## 1. Code Correctness
-### 1.1 Bugs Found and Fixed (DEV Round 1)
-| # | Severity | Description | Status |
-|---|----------|-------------|--------|
-| B1 | **Critical** | Prototype fields (`_current_prototype_sum`, etc.) only initialized under `if not run_single:`. First task (run_single=True) crashes with `AttributeError` when `_update_prototype()` is called in `forward()`. | ✅ Fixed — moved to `if not self.is_decoder:` block |
-| B2 | **Critical** | `.squeeze(-1)` in cosine sim changes shape (B,1)→(B,), then division by h_norm (B,1) broadcasts to **(B,B)** instead of (B,1). Routing weights would be completely wrong. | ✅ Fixed — removed `.squeeze(-1)` |
-| B3 | **Important** | During training eval, `_current_task_prototype` is None (not finalized) → prototype routing falls back to spectral routing → mid-training eval metrics don't reflect prototype quality. | ✅ Fixed — use running mean as temporary prototype |
-### 1.2 Remaining Issues
-| # | Severity | Description | Recommendation |
-|---|----------|-------------|----------------|
-| R1 | **Medium** | `_compute_spectral_entropy_loss()` uses `qr(B.T)` and `qr(A)` — mathematically incorrect QR trick. Should be `qr(B)` and `qr(A.T)` to match `_thin_svd_low_rank()`. Current code computes an *approximation* of true singular values (off by orthogonal rotation factor $W = Q_{B'}^T Q_{A'}$). | Fix: change `qr(B.T) → qr(B)`, `qr(A) → qr(A.T)` in entropy loss. Low priority since regularization is approximate by nature. |
-| R2 | **Low** | Prototype accumulation moves data to CPU every forward pass (`.cpu()` in `_update_prototype`). Minimal overhead (~16KB per call) but adds GPU→CPU transfer per batch. | Acceptable. Alternative: accumulate on GPU, move once at finalize. Not worth optimizing unless profiling shows bottleneck. |
-| R3 | **Low** | No direct test for prototype routing quality before running full 15-task experiment. | Consider: quick sanity check — after training 2 tasks, verify cos(h_task1, μ_task1) > cos(h_task1, μ_task2) on a few samples. |
-### 1.3 Shape Analysis (Verified ✓)
-```
-_update_prototype:
-  h_batch: (B, d_model) → .sum(dim=0) → (d_model,) ✓
-compute_spectral_routing (inference, prototype branch):
-  h_flat: (B, d_model)
-  h_norm: (B, 1)
-  proto p: (d_model,) → p.unsqueeze(-1): (d_model, 1)
-  matmul(h_flat, p.unsqueeze(-1)): (B, 1)    ← correct after .squeeze(-1) removed
-  / h_norm: (B, 1) / (B, 1) = (B, 1)        ← correct
-  fits: list of (B, 1) tensors
-  cat(fits, dim=1): (B, n_tasks) ✓
-  softmax → (B, n_tasks) → unsqueeze(2) → (B, n_tasks, 1) ✓
-```
----
-## 2. Theoretical Soundness
-### 2.1 GPM-Routing Paradox — **Sound ✓**
-The paradox is well-formulated and mathematically valid:
-- GPM forces $\mathcal{S}_k \perp \mathcal{S}_j$ → same-domain tasks lose access to dominant input directions
-- Spectral affinity $\alpha_k(h) = \|V_k^T h\|^2 / \|h\|^2$ depends on LoRA subspace alignment
-- For $h \sim P(h|\text{imdb})$ with imdb's A in yelp's null-space: $\alpha_{\text{imdb}}(h) \ll \alpha_{\text{yelp}}(h)$
-**Critical validation**: ROOT uses the SAME GPM on LoRA-A and achieves AP=59.70. This confirms the issue is routing (spectral ≠ learned MLP), not orthogonality itself.
-### 2.2 Prototype Routing — **Conditionally Sound**
-**Strengths:**
-1. **Embedding space is GPM-immune** — prototypes live outside LoRA subspace ✓
-2. **Zero-replay, drift-free** — frozen embedding table, O(d) storage per task ✓
-3. **Task instruction prefix** provides strong discriminative signal (not discussed in theory but practically important)
-**Weaknesses / Assumptions:**
-1. **LDA optimality assumption**: Requires Gaussian mixture with shared covariance — real NLP data violates this. Prototype routing is a heuristic, not provably optimal.
-2. **Same-vocabulary tasks**: yelp vs amazon (both product reviews) may have very similar $\mu$. Cosine similarity may not discriminate well between these.
-3. **Multi-modal tasks**: Tasks like mnli have diverse input distributions (entailment/contradiction/neutral). The mean $\mu_{\text{mnli}}$ averages over modes → poor representative prototype.
-### 2.3 Training-Inference Gap
-**Concern**: Training uses A-row fit + bias (subspace signal), inference uses prototype cosine (vocabulary signal). These are fundamentally different routing mechanisms measuring different properties.
-- During training: B learns under 80% routing weight to current LoRA (forced by adaptive β)
-- During inference: prototype router selects expert based on vocabulary distribution
-- The expert's B was optimized under forced routing, not under prototype-based routing
-**Mitigation**: This gap also exists in ROOT (learned MLP vs frozen MLP at inference). It's inherent to soft-routing CL. The adaptive β ensures sufficient gradient flow regardless of routing mechanism.
----
-## 3. Methodology Assessment
-### 3.1 Alignment with Theory
-| Theory claim | Implementation | Aligned? |
-|---|---|---|
-| GPM-immune routing | Prototype from frozen embeddings | ✓ |
-| Same-domain discrimination | cosine(μ_k, h) captures vocabulary differences | ✓ (conditional on vocabulary divergence) |
-| Zero-replay | Running mean, O(d) per task | ✓ |
-| Drift-free | Frozen embedding table | ✓ |
-| Dual-mode routing | Train=A-row+β, Inference=prototype | ✓ |
-| C4 orthogonal to routing | Preconditioning + entropy unchanged | ✓ |
-### 3.2 Missing Design Decisions
-1. **Temperature calibration**: Cosine similarities are typically in [0.7, 0.99] range. With `attn_temperature=0.01`:
-   - cos_diff = 0.03 → score_diff = 3.0 → reasonable routing
-   - But if all tasks have cosines ≈ 0.95 ± 0.01: softmax([95, 94.9, 94.8, ...]) ≈ near-uniform
-   - **No analysis of expected cosine separation** provided in SPECROUTE_IDEA.md
-2. **No hard routing option**: The idea doc mentions $k^* = \arg\max_k \cos(h, \mu_k)$ as an option but it's not implemented. Hard routing might outperform soft routing for clearly separable tasks.
-3. **No per-layer prototypes**: Prototypes are computed from input embeddings (layer 0). If later layers' representations are more discriminative, per-layer prototypes could improve routing.
----
-## 4. Potential Failure Modes
-### 4.1 Prototype Collision
-**Scenario**: Tasks with very similar vocabulary distributions (yelp/amazon/imdb all sentiment).
-**Impact**: Routing errors → EM = 0.
-**Likelihood**: Medium for yelp↔amazon, Low for yelp↔imdb (movie vs restaurant vocabulary).
-**Mitigation**: Task instruction prefix in T5 provides additional discrimination.
-### 4.2 Temperature Sensitivity
-**Scenario**: `attn_temperature=0.01` is the same for spectral (training) and prototype (inference).
-**Impact**: Spectral fits ∈ [0, 0.5], cosine ∈ [0.7, 0.99]. The same temperature maps different score ranges.
-**Mitigation**: Prototype softmax only runs during inference (no gradient coupling). Wrong temperature only causes over-soft/over-hard routing, not gradient issues.
-### 4.3 Cold-Start Prototype Quality
-**Scenario**: During first few training steps, running mean is computed from very few samples.
-**Impact**: Early eval metrics may be noisy; prototype quality improves over training.
-**Likelihood**: Low risk — eval steps typically happen after enough batches.
-### 4.4 Spectral Entropy Bug (Pre-existing)
-**Issue**: `_compute_spectral_entropy_loss()` computes `svdvals(R_B' @ R_A'^T)` where the QR decomposition is applied to transposed matrices compared to `_thin_svd_low_rank()`.
-**Mathematical analysis**: Current code computes `svdvals(R_{B^T} @ R_A^T)` ≠ `svdvals(R_B @ R_{A^T})` (the correct singular values of BA).
-**Impact on V5**: The entropy regularizer still approximately encourages rank diversity, but the gradient direction is slightly wrong. This pre-dates V5 and affects V3/V4 equally.
----
-## 5. Recommendations
-### Priority: High
-1. **Fix spectral entropy QR bug** (R1): Change `qr(B.T) → qr(B)` and `qr(A) → qr(A.T)` in `_compute_spectral_entropy_loss()`. This ensures mathematically exact singular values for regularization.
-### Priority: Medium
-2. **Add temperature analysis**: After training 2-3 tasks, log the cosine similarity distribution between test samples and all prototypes. Verify discrimination margin. Adjust temperature if needed.
-3. **Consider separate `inference_temperature`**: Allow prototype routing to use a different temperature than spectral routing during training.
-### Priority: Low
-4. **Prototype sanity check**: Before full 15-task run, do a 2-task pilot to verify cos(h, μ_correct) > cos(h, μ_wrong).
-5. **Document instruction prefix advantage**: The theory section doesn't mention that CL benchmark tasks have unique instruction prefixes, which strongly benefits prototype routing.
----
-## 6. Overall Assessment
-**Verdict**: V5 design is **theoretically well-motivated** and **implementation is correct** (after DEV Round 1 fixes). The GPM-Routing Paradox is a genuine insight that explains V3's AP=33.77 failure mode. Prototype routing is a reasonable solution that decouples routing from the LoRA subspace.
-**Main risk**: Prototype discrimination quality for same-vocabulary tasks. The approach may struggle with yelp↔amazon but should handle the critical failing tasks (imdb, sst2, wic, cb) well due to vocabulary divergence.
-**Expected outcome**: AP(EM) improvement from 33.77 → 40-50 range, driven primarily by recovering the 6 failing tasks. Whether it exceeds ROOT's 59.70 depends on C4's effectiveness + prototype routing quality.
-**Pre-existing issue worth fixing**: Spectral entropy QR bug (R1) — affects all versions, easy fix, mathematically correct.

results/review_2.md DELETED Viewed

@@ -1,184 +0,0 @@
-# Review Round 2 — V5 Deeper Analysis
-**Reviewer role**: Objective reviewer (deeper architectural / edge case / theoretical analysis)
-**Scope**: Building on review_1 findings; focusing on issues review_1 missed
----
-## 1. Prototype Discrimination Quality — **Key Risk**
-### 1.1 Common-Word Domination Problem
-The prototype $\mu_k = \frac{1}{N}\sum_i \bar{h}_i^{(k)}$ averages over ALL tokens including stop words (the, a, is, of, to). These high-frequency tokens have large aggregate weight in the mean, making prototypes for different tasks closer than expected.
-**Quantitative estimate**: For a sentence of 80 tokens, ~40 are stop words (50% typical in English). The stop word embeddings are shared across ALL tasks → they push all prototypes toward a common center. The task-discriminative signal comes from:
-1. **Content words** (~40 tokens): "movie", "restaurant", "hypothesis", etc.
-2. **Instruction prefix** (~15-20 tokens): unique per task in CL benchmarks.
-For tasks like yelp vs amazon (both review tasks), content words overlap significantly. Discrimination relies heavily on the instruction prefix.
-**Severity**: Medium. The instruction prefix provides a constant, distinctive signal in every sample. But for tasks with very similar instructions (mnli vs rte, both NLI), discrimination may be weak.
-### 1.2 Score Range and Temperature
-Cosine similarities between task prototypes are likely **very high** (>0.9) because:
-- All prototypes share common English embedding structure
-- T5 embeddings are not unit-normalized; the common component dominates
-With `attn_temperature = 0.01` and cosines in [0.92, 0.97]:
-- Score range: [92, 97], max gap ≈ 5
-- softmax([97, 95, 93, 92, ...]) → dominant task gets ~50-73% weight
-- This is reasonable soft routing
-But with cosines in [0.95, 0.96] (near-identical):
-- Score range: [95, 96], gap ≈ 1
-- softmax([96, 95.8, 95.5, ...]) → nearly uniform → routing fails
-**Recommendation**: After first few tasks, LOG the cosine similarity matrix between all prototypes. If max gap < 2 (in temperature-scaled space), consider:
-- Lower temperature (harder routing)
-- Mean-centering prototypes: $\tilde{\mu}_k = \mu_k - \bar{\mu}$ to remove the common component
-- Or TF-IDF weighting (but adds complexity)
----
-## 2. Edge Cases Verified ✓
-| Case | Handling | Correct? |
-|------|----------|----------|
-| First task (run_single=True) | Prototype accumulated, no routing, saved after training | ✓ |
-| Task 2 with 1 old prototype | _n_expected=2, _protos=[current_running_mean, old_proto]=2 → prototype routing active | ✓ |
-| Missing prototype files | _protos < _n_expected → spectral fallback | ✓ |
-| All zeros attention_mask | mask_count.clamp(min=1) → avg=0 → routing still works (cos(0, μ)=0 for all) | ✓ |
-| Gradient checkpointing | Prototype update at T5Stack.forward() top, not inside checkpointed blocks → runs once | ✓ |
-| Decoder | is_decoder=True → routing block skipped, uses encoder's weights | ✓ |
-| Mixed precision (fp16/bf16) | Prototype accumulated in float32 on CPU; cast at routing time | ✓ |
-| Memory | Single (d_model,) tensor per task; cleared after finalize | ✓ |
----
-## 3. Masked Mean Change — Impact Analysis
-V5 changed `avg_inputs_embeds` from `.mean()` to `.sum()/mask_count`:
-- V3: $h = \frac{1}{L}\sum_i m_i e_i$ (divided by total seq length including padding)
-- V5: $h = \frac{\sum_i m_i e_i}{\sum_i m_i}$ (divided by non-padding count)
-**Impact on routing**: All scoring functions (A-row fit, spectral fit, cosine sim) are scale-invariant (Rayleigh quotient / cosine). So the magnitude change doesn't affect routing scores.
-**Impact on direction**: The direction changes because padding tokens contribute 0 to numerator but inflate denominator in V3. V5 gives the true mean direction. This is more correct.
-**Impact on training routing vs V3**: The A-row fit during training uses the same `avg_inputs_embeds`. Since fit ∝ $\|A h\|^2 / \|h\|^2$, both the projection and normalization scale together. Net effect: negligible for same-padding batches, slight direction improvement for mixed-padding batches.
-**Conclusion**: The change is beneficial and backward-compatible. ✓
----
-## 4. Theoretical Gaps
-### 4.1 No Formal Guarantee for Prototype Routing
-The SPECROUTE_IDEA.md C2.1 section invokes LDA optimality under Gaussian mixture assumption. However:
-- **Real embeddings are NOT Gaussian**: Token embedding means have complex geometry
-- **Shared covariance assumption violated**: Different tasks have different variance structures (sentiment has high variance on affect dimensions; factual tasks have high variance on entity dimensions)
-- **Bayes risk bound inapplicable**: Without equal covariance, cosine similarity is not Bayes-optimal
-**However**: Prototype routing doesn't need to be Bayes-optimal — it just needs to outperform spectral routing (which has a provable failure mode for same-domain tasks). The bar is low given V3's AP=33.77.
-### 4.2 No Analysis of Prototype Drift Across Training
-The running mean is computed during task k's training. But training modifies lora_B, which affects:
-- The training loss landscape → different gradient sizes → different effective contribution per sample?
-No — the prototype uses frozen embeddings (inputs_embeds), which are independent of lora_B updates. The prototype is truly static w.r.t. model parameters. ✓
-### 4.3 Info Leakage from Eval During Training
-Bug fix B3 uses running mean as temporary prototype during training eval. The running mean is computed from TRAINING data → it reflects the training distribution. Eval data comes from a held-out split → slightly different distribution.
-**Impact**: Minimal. The training and eval distributions for the same task are very close. The running mean converges to the true task mean quickly (after ~50 batches).
----
-## 5. Architecture Consistency
-### 5.1 Training vs Inference Routing — Consistent?
-During training:
-- w_cur ≈ 0.8 (adaptive β), w_old ≈ 0.2/N (spectral routing)
-- ΔW = w_cur · B_cur · A_cur · h + Σ w_old_j · B_j · A_j · h
-During inference:
-- w_k = softmax(cos(h, μ_k)/τ) for ALL k
-- ΔW = Σ w_k · B_k · A_k · h
-The current task's contribution during training (80%) is much higher than during inference (maybe 30-60% depending on prototype discrimination). This means the model is trained with a routing distribution that's different from inference.
-**Is this a problem?** In ROOT, the same asymmetry exists (learned routing at training time vs frozen routing at inference). It's standard in CL. The key is that the current task gets sufficient gradient signal during training (80% weight ensures this).
-### 5.2 Order of Prototypes vs LoRA Weights
-Loading order: `previous_lora_list_sig.reverse()` → oldest first.
-- spectral_sigs[0] = oldest task
-- task_protos[0] = oldest task
-- previous_lora_weights loaded in same order
-In compute_spectral_routing, fits[0] = current task, fits[1:] = old tasks (oldest first).
-In prototype routing, _protos[0] = current, _protos[1:] = task_protos (oldest first).
-Both match. ✓
----
-## 6. Performance Predictions
-### 6.1 Tasks That Should Improve (routing fix)
-- **imdb, sst2** (EM was 0): Different instruction + vocabulary from yelp/amazon → prototypes should separate → EM > 0
-- **wic** (EM was 0): Word-in-context task, very different vocabulary from sentiment → should separate well
-- **cb, rte** (EM was 0): NLI tasks, different from sentiment → moderate prototype separation
-- **yelp, amazon** (EM ~36): Should maintain or improve with correct routing
-### 6.2 Tasks That Might NOT Improve
-- **mnli** (EM ~2): Multi-modal distribution (entailment/contradiction/neutral). Prototype averages over modes → weak signal. Also, mnli is trained late (task 4 in order 3) → still OK for prototype separation from earlier tasks.
-### 6.3 Overall AP Projection
-If imdb/sst2/wic go from 0 → ROOT-level (~50-70 range):
-- AP gain ≈ (50+60+55) / 15 / 3 ≈ +11 pts (assuming ~55 avg EM for these 3)
-- If cb/rte also recover: additional +5-8 pts
-- Net AP: 33.77 + 11 + 6 ≈ **50-51 AP(EM)**
-- Compared to ROOT=59.70: still ~9 pts gap (from overall representation quality)
-C4 (preconditioning + entropy) could close another 3-5 pts if it improves single-task quality.
----
-## 7. New Recommendations
-### 7.1 (High) Consider Mean-Centering Prototypes
-After all prototypes are collected, subtract the global mean:
-$$\tilde{\mu}_k = \mu_k - \frac{1}{T}\sum_j \mu_j$$
-This removes the common English embedding component and highlights task-discriminative differences. Implement in `finalize_prototype()` or at routing time.
-**Caution**: This requires all prototypes to be available before routing, which conflicts with the streaming CL setup where we only have old prototypes + current. A compromise: subtract the running mean of old prototypes from both old and current prototypes.
-### 7.2 (Medium) Log Cosine Similarity Matrix
-Add diagnostic logging during final evaluation: compute and print the pairwise cosine similarity matrix between all prototypes. This reveals:
-- Which tasks have similar prototypes (collision risk)
-- Whether temperature is appropriate (check score gaps)
-### 7.3 (Low) Consider Prototype + Spectral Ensemble
-Instead of pure prototype routing, combine prototype cosine with spectral fit:
-$$s_k(h) = \alpha \cdot \cos(h, \mu_k) + (1-\alpha) \cdot \text{spectral\_fit}(h; V_k, \sigma_k)$$
-This handles both same-domain tasks (prototype helps) and orthogonal tasks (spectral helps). But adds a hyperparameter α — violates simplicity principle.
----
-## 8. Overall Assessment (Round 2)
-**Code quality**: Good after DEV Round 1+2 fixes. No remaining bugs found. All edge cases handled correctly.
-**Main risk**: Prototype discrimination quality — specifically for tasks with similar instruction prefixes. The T5 CL benchmark tasks generally have distinctive instructions, so this should work in practice.
-**Temperature**: `attn_temperature=0.01` maps cosine scores [0.9, 0.97] to [90, 97] → softmax gives reasonable routing. Acceptable without tuning, but logging recommended.
-**Overall confidence**: 75% that V5 significantly improves AP over V3 (33.77). 50% that it reaches ROOT (59.70). The gap would be closed by better single-task quality (C4) and potentially mean-centering (R7.1).

results/review_3.md DELETED Viewed

@@ -1,129 +0,0 @@
-# Review Round 3 — V5 Pipeline & Temperature
-**Reviewer role**: Objective reviewer (pipeline correctness, integration, experiment readiness)
-**Scope**: Full pipeline flow, temperature analysis, shell script verification
----
-## 1. Critical Finding: Temperature Mismatch — **FIXED**
-### Problem
-All V3/V4/V5 shell scripts omit `--attn_temperature`, using the default `routing_temperature = 1.0`. For spectral routing (V3), this produced near-uniform routing at inference (fits ∈ [0, 0.5], softmax barely discriminative). For prototype routing (V5), the situation would be even worse: cosine ∈ [0.85, 0.95], softmax with T=1.0 → near-uniform.
-**Impact if unfixed**: Prototype routing would provide almost zero discrimination. V5 would behave similarly to uniform averaging over all LoRAs → worse than V3.
-### Analysis
-| Temperature | Spectral fit gap | Cosine sim gap | Ratio (best:worst) |
-|-------------|-----------------|----------------|-------------------|
-| T=1.0 | 0.3 → exp(0.3)=1.35 | 0.05 → exp(0.05)=1.05 | Near-uniform |
-| T=0.1 | 3.0 → exp(3)=20 | 0.5 → exp(0.5)=1.65 | Mild discrimination |
-| T=0.01 | 30 → exp(30)→∞ | 5.0 → exp(5)=148 | Semi-hard routing |
-### Fix Applied
-Added `self._prototype_temperature = 0.01` as an algorithmic constant (not a hyperparameter). Prototype routing uses this separate temperature, while training spectral routing continues using `routing_temperature` (1.0 default, with adaptive β compensating).
-This ensures:
-- **Training**: β formula works correctly with T=1.0, giving w_cur ≈ 80%. ✓
-- **Inference (prototype)**: T=0.01 gives semi-hard routing. For gap=0.05: ratio=148:1. ✓
-- **Inference (fallback spectral)**: Uses T=1.0, same as V3. Fair comparison. ✓
-- **Experiment isolation**: The ONLY difference between V3 and V5 at inference is the routing mechanism (prototype vs spectral), not the temperature.
----
-## 2. Pipeline Flow Verification
-### Full sequence for task k (k ≥ 2):
-```
-1. Load model + fresh LoRA
-2. Load old LoRA weights (reversed: oldest first)
-3. Load spectral signatures + task prototypes
-4. get_reg_matrix() → project A into null-space (InfLoRA)
-5. precompute_preconditioners() → (AA^T+εI)^{-1/2} for lora_B gradients
-6. Training loop:
-   a. forward() → compute avg_inputs_embeds (masked mean)
-   b. _update_prototype(avg_inputs_embeds) → accumulate running mean
-   c. compute_spectral_routing() → training branch (A-row + β)
-   d. training_step() → CE + entropy reg + preconditioning
-7. Save LoRA weights + spectral signatures
-8. finalize_prototype() → normalize μ_k, save task_prototype.pt
-9. get_representation() → collect GPM bases (may call forward → accumulates garbage prototype data, but harmless since already finalized+saved)
-10. is_inference = True
-11. prepare_inference_routing() → SVD of current task's LoRA (fallback only)
-12. predict() → forward in eval mode → prototype routing (separate T=0.01)
-```
-**Verified**: No race conditions, no data leakage, no gradient flow issues. ✓
-### Edge case: get_representation after finalize_prototype
-`get_representation()` does forward passes that call `_update_prototype()`, accumulating data after finalize already reset `_current_prototype_sum`. This is harmless:
-- The saved prototype (step 8) is the correct one
-- The garbage accumulation after finalize is never used (predict restores from finalize result)
-- `_current_task_prototype` (set by finalize) is NOT modified by `_update_prototype`
----
-## 3. Shell Script Verification
-| Item | V5 Script | Expected | Status |
-|------|-----------|----------|--------|
-| `--model_name specroute` | ✓ | specroute | ✓ |
-| `--threshold 0.995` | ✓ | ESA threshold | ✓ |
-| `--target_routing_alpha 0.8` | ✓ | Adaptive β target | ✓ |
-| `--lambda_entropy 0.01` | ✓ | C4 entropy weight | ✓ |
-| `--use_preconditioning True` | ✓ | C4 preconditioner | ✓ |
-| `--precond_eps 1e-6` | ✓ | Preconditioner regularization | ✓ |
-| `--entropy_warmup_ratio 0.1` | ✓ | Entropy warmup | ✓ |
-| `--attn_temperature` | NOT SET | Uses default 1.0 | ✓ (training uses 1.0, prototype uses hardcoded 0.01) |
-| lora_r=8, alpha=32 | ✓ | Same as ROOT | ✓ |
-| 15 tasks, order 3 | ✓ | Long benchmark | ✓ |
-| `--num_train_epochs 10` | ✓ | Same as ROOT | ✓ |
-No changes needed to the shell script.
----
-## 4. Code Cleanliness Post-Fixes
-### All bugs fixed across 3 rounds:
-| Round | Bug | File | Fix |
-|-------|-----|------|-----|
-| DEV 1 | Prototype fields not initialized for run_single | t5_specroute.py | Moved to `if not is_decoder:` block |
-| DEV 1 | `.squeeze(-1)` causes (B,B) shape in cosine sim | t5_specroute.py | Removed .squeeze(-1) |
-| DEV 1 | Training eval falls back to spectral (no prototype) | t5_specroute.py | Use running mean as temp prototype |
-| DEV 2 | Entropy QR uses wrong decomposition | cl_trainer_specroute.py | Changed qr(B.T)→qr(B), qr(A)→qr(A.T) |
-| DEV 3 | No prototype discrimination diagnostics | t5_specroute.py | Added pairwise cosine similarity logging |
-| DEV 3 | T=1.0 gives near-uniform prototype routing | t5_specroute.py | Added separate _prototype_temperature=0.01 |
-### Remaining non-blocking items:
-1. **Prototype accumulation during get_representation**: Benign, no fix needed
-2. **Mean-centering**: Deferred to V5.1 pending diagnostic data
-3. **Type annotation**: `attn_temperature` is `Optional[int]`, should be `Optional[float]` — works fine in practice (Python int→float coercion)
----
-## 5. Experiment Readiness Assessment
-**Ready to run?** ✅ YES
-**Checklist:**
-- [x] All 3 source files compile (t5_specroute.py, cl_trainer_specroute.py, run_t5.py)
-- [x] 6 bugs fixed (3 critical, 1 medium, 2 minor)
-- [x] Prototype routing with correct temperature
-- [x] Diagnostic logging for prototype quality
-- [x] Shell script configured correctly
-- [x] SPECROUTE_IDEA.md updated with C2.1 theory
-- [x] experiment_versions.md documented
-**Expected behavior on first run:**
-- Log: `[SpecRoute] Finalized task prototype (N samples, d=512)` after each task's training
-- Log: `[SpecRoute] Loaded K spectral signatures, K task prototypes` at start of each task ≥2
-- Log: `[SpecRoute] Prototype cosine matrix (n=K+1): off-diag min=..., max=..., mean=...` at prediction time
-- Routing weights during training: current task ≈80% (adaptive β with T=1.0)
-- Routing weights during inference: dominant weight on correct prototype (T=0.01)
-**What to monitor:**
-1. `prototype cosine matrix` log → check min gap between tasks. If max off-diag > 0.99: discrimination poor, need mean-centering
-2. eval_em during training of imdb/sst2/wic → should be > 0 (prototype active via running mean)
-3. Final AP(EM) → target > 45

results/review_4.md DELETED Viewed

@@ -1,130 +0,0 @@
-# Review Round 4 — Final Comprehensive Assessment
-**Reviewer role**: Final sign-off reviewer (code, theory, methodology, experiment readiness)
-**Scope**: Complete V5 — all files, all changes, all theory
----
-## 1. Final Code Audit
-### 1.1 Modified Files Summary
-| File | Lines changed | Changes | Status |
-|------|-------------|---------|--------|
-| `t5_specroute.py` | ~80 added/modified | Prototype fields, `_update_prototype()`, `finalize_prototype()`, dual-mode inference routing, masked mean fix, diagnostic logging, separate temperature | ✅ Compiles, logically correct |
-| `cl_trainer_specroute.py` | 3 lines | Entropy QR fix: `qr(B.T)→qr(B)`, `qr(A)→qr(A.T)`, removed shape check | ✅ Compiles, mathematically correct |
-| `run_t5.py` | ~15 added | Load task prototypes, finalize+save after training | ✅ Compiles, pipeline correct |
-| `SPECROUTE_IDEA.md` | ~60 added | C2.1 section (paradox + prototype routing + dual temperature) | ✅ Theory documented |
-| `experiment_versions.md` | ~80 added/modified | V5 section with full analysis and code changes | ✅ Experiment documented |
-### 1.2 Bug Resolution Tracking
-| # | Bug | Severity | Found | Fixed | Verified |
-|---|-----|----------|-------|-------|----------|
-| B1 | Prototype fields not init for run_single | Critical | Review 1 | DEV 1 | ✅ |
-| B2 | `.squeeze(-1)` causes (B,B) shape | Critical | Review 1 | DEV 1 | ✅ |
-| B3 | Training eval uses spectral fallback | Important | Review 1 | DEV 1 | ✅ |
-| B4 | Entropy QR wrong decomposition | Medium | Review 1 | DEV 2 | ✅ |
-| B5 | T=1.0 kills prototype discrimination | Critical | Review 3 | DEV 3 | ✅ |
-| B6 | Old task loop over-indented | Minor | Review 3 | DEV 3 | ✅ |
-All 6 bugs found and fixed. 3 were critical (would cause crash or complete misfunction).
----
-## 2. Theoretical Assessment
-### 2.1 GPM-Routing Paradox — **Valid ✓**
-- ROOT uses same GPM + achieves AP=59.70 → orthogonality is not the bottleneck
-- Spectral routing fails for same-domain tasks due to subspace orthogonality
-- Formally stated and correct
-### 2.2 Prototype Routing Solution — **Sound with caveats**
-- Decouples routing from LoRA subspace ✓
-- Zero-replay, drift-free, O(d) per task ✓
-- LDA justification is heuristic (not strict Gaussian mixture) but reasonable
-- **Residual risk**: same-vocabulary tasks (yelp/amazon) may have similar prototypes
-### 2.3 Dual-Temperature Design — **Well-motivated ✓**
-- Training: T=1.0 with adaptive β → w_cur ≈ 80%. Proven correct.
-- Inference prototype: T=0.01 → semi-hard routing. Matches prototype metric learning practices.
-- V3↔V5 comparison is fair: training routing is identical, only inference mechanism differs.
-### 2.4 C4 (Preconditioning + Entropy) — **Orthogonal, unaffected ✓**
-- Entropy QR bug fixed (computation now matches `_thin_svd_low_rank`)
-- Preconditioning unchanged
-- Both operate on LoRA weights, independent of routing mechanism
----
-## 3. Methodology Integrity
-### 3.1 Changes from V3 → V5
-| Aspect | V3 | V5 | Justified? |
-|--------|----|----|-----------|
-| Training routing | A-row + β (T=1.0) | Same | ✓ (no change) |
-| Inference routing | SVD spectral (T=1.0) | Prototype cosine (T=0.01) | ✓ (fixes paradox) |
-| avg_inputs_embeds | `.mean()` (includes padding) | `.sum()/mask_count` (correct) | ✓ (bug fix, scale-invariant) |
-| Entropy QR | Wrong `qr(B.T),qr(A)` | Correct `qr(B),qr(A.T)` | ✓ (bug fix) |
-| Prototype accumulation | N/A | Running mean in forward() | ✓ (zero overhead) |
-| Prototype storage | N/A | task_prototype.pt per task | ✓ (512 floats per task) |
-### 3.2 Confounding Variables
-- **Masked mean fix**: Could slightly change training routing direction, but scale-invariant scoring means impact is negligible
-- **Entropy QR fix**: Affects C4 regularization quality — this pre-existing bug was also present in V3. Fixing it makes V5 strictly better but slightly unfair vs V3. Acceptable since it's a bug fix.
-- **Temperature**: Training T=1.0 is unchanged. Inference T differs but this IS the routing mechanism change (not a confound).
----
-## 4. Risk Assessment
-| Risk | Severity | Likelihood | Mitigation |
-|------|----------|------------|------------|
-| Prototypes too similar for nearby tasks | High impact | Medium | Diagnostic logging added; mean-centering available as V5.1 |
-| T_proto=0.01 too aggressive | Medium | Low | Routing quality visible via routing weight logs |
-| First task prototype missing | High impact | Low | Prototype saved for all tasks including first (run_single) |
-| get_representation corrupts prototype | High | None | Verified: finalize+save before get_representation |
-| Mixed V3/V5 checkpoints | Medium | Low | Graceful fallback to spectral routing |
----
-## 5. Final Verdict
-### Code Quality: **A-**
-Clean implementation with proper separation of concerns. All edge cases handled. Diagnostic logging for production debugging. Minor deduction for: hardcoded `_prototype_temperature` (could be configurable) and benign prototype accumulation during `get_representation`.
-### Theoretical Rigor: **B+**
-GPM-Routing Paradox is genuine and well-formulated. Prototype routing is reasonable but not provably optimal (LDA assumption is approximate). The dual-temperature is well-justified. Missing: formal bound on prototype routing accuracy.
-### Experiment Design: **A**
-Fair comparison with V3 (same training routing). Controlled change (only inference mechanism). Proper documentation. Diagnostic tools for post-hoc analysis.
-### Ready for Execution: **YES ✅**
-**Expected AP(EM) range**: 40-55 (high confidence: 35-60)
-- Lower bound: prototypes similar for some tasks → partial recovery → +7 over V3
-- Upper bound: prototypes discriminative for all tasks + C4 effective → approaches ROOT
-### What Comes Next (if AP < 50):
-1. Check diagnostic logs: are prototypes separable?
-2. If max off-diag cosine > 0.95: implement mean-centering (V5.1)
-3. If routing looks correct but EM still low: investigate single-task quality (C4 tuning)
-4. If both routing and quality are fine: the gap to ROOT is from learned routing's adaptability
----
-## 6. Files Delivered
-| File | Type | Description |
-|------|------|-------------|
-| [t5_specroute.py](improve_gainlora/src/t5_specroute.py) | Code | V5 prototype routing + all fixes |
-| [cl_trainer_specroute.py](improve_gainlora/src/cl_trainer_specroute.py) | Code | Entropy QR fix |
-| [run_t5.py](improve_gainlora/src/run_t5.py) | Code | Prototype load/save |
-| [SPECROUTE_IDEA.md](improve_gainlora/SPECROUTE_IDEA.md) | Theory | C2.1 section + dual-temperature |
-| [experiment_versions.md](results/experiment_versions.md) | Tracking | V5 section complete |
-| [review_1.md](results/review_1.md) | Review | Code + theory review |
-| [review_2.md](results/review_2.md) | Review | Edge cases + prediction |
-| [review_3.md](results/review_3.md) | Review | Pipeline + temperature fix |
-| [review_4.md](results/review_4.md) | Review | Final assessment |
-| [gen_script_long_order3_t5_small_specroute_v5.sh](improve_gainlora/T5_small/gen_script_long_order3_t5_small_specroute_v5.sh) | Script | V5 experiment script |

results/specroute_v2_diagnosis.md DELETED Viewed

@@ -1,174 +0,0 @@
-# SpecRoute V2 → V3: Chẩn đoán toàn diện & Kế hoạch khắc phục
-## 1. Tổng quan kết quả V2
-| Metric | SpecRoute V2 | ROOT (GainLoRA-InfLoRA) | Gap |
-|--------|-------------|------------------------|-----|
-| AP(EM) | 30.73 | 59.70 | **-28.97** |
-| AP(rougeL) | 38.00 | 61.66 | **-23.66** |
-### Phân loại thất bại chi tiết
-**Nhóm 1: EM = 0 suốt quá trình training (LABEL FORMAT MISMATCH)**
-| Task | Pos | Prediction | Ground Truth | Nguyên nhân |
-|------|-----|-----------|-------------|-------------|
-| imdb | 8 | "positive"/"negative" | "Good"/"Bad" | Misrouted → yelp/amazon LoRA |
-| sst2 | 9 | "negative" | "Bad" | Misrouted → yelp/amazon LoRA |
-| wic | 15 | "the same meaning"/"different" | "True"/"False" | Misrouted → unknown LoRA |
-| cb | 4 | "bedroom"/"yes"/"virtuous"/gibberish | "entailment"/"neutral"/"contradiction" | Misrouted → unrelated LoRA |
-**Nhóm 2: Catastrophic Forgetting**
-| Task | Best EM | Final EM | Nguyên nhân |
-|------|---------|----------|-------------|
-| mnli | 31.25 (ep8) | 0.25 | Degenerate → always "neutral" |
-| rte | 51.26 (ep4) | 0.36 | Degenerate → always "entailment" |
-**Nhóm 3: Hoạt động bình thường**
-| Task | SpecRoute | ROOT | Status |
-|------|-----------|------|--------|
-| copa | **47.00** | 42.00 | ✅ Better (+5.0) |
-| multirc | **55.42** | 50.52 | ✅ Better (+4.9) |
-| boolq | **61.44** | 60.43 | ✅ Better (+1.0) |
-| qqp | **77.03** | 76.96 | ✅ Tie |
----
-## 2. Chẩn đoán nguyên nhân gốc (Root Cause Analysis)
-### 2.1 BUG CHỦ YẾU: Constant Training Bias Không Scale Theo Số Task
-**Công thức hiện tại:**
-$$\text{fit}_{\text{cur}} = \frac{1}{L}\sum_\ell \frac{\sum_i (a_i \cdot h)^2}{r \|h\|^2} + \beta \quad (\beta = 1.0 \text{ cố định})$$
-**Softmax routing weight cho current task:**
-$$w_{\text{cur}} = \frac{e^{(\text{fit}_{\text{cur}})/T}}{e^{(\text{fit}_{\text{cur}})/T} + (n-1) \cdot e^{\text{fit}_{\text{old}}/T}}$$
-Với $\beta = 1.0$, $T = 1.0$, fit_raw ≈ 0.12, fit_old ≈ 0.20:
-| n_tasks | Training $w_{\text{cur}}$ | Inference $w_{\text{cur}}$ |
-|---------|--------------------------|---------------------------|
-| 1 | 100% | 100% |
-| 2 | 71.5% | 48.0% |
-| 5 | 38.5% | 18.8% |
-| **8 (imdb)** | **26.4%** | **11.7%** |
-| 10 | 21.8% | 9.3% |
-| **15 (wic)** | **15.2%** | **6.2%** |
-**Hậu quả:**
-- Task 8 (imdb): Chỉ 26.4% routing weight khi training → 73.6% gradient signal đi qua LoRA cũ → model không thể học label "Good"/"Bad"
-- Task 15 (wic): Chỉ 15.2% routing weight → gần như không học được gì
-**So sánh với ROOT:** ROOT dùng sigmoid độc lập cho mỗi task: $w_k = |2\sigma(4\cos(x, \text{key}_k)) - 1|$. Không có zero-sum competition → mỗi task có thể đạt weight ~0.8 bất kể số task.
-### 2.2 BUG THỨ HAI: Bất đối xứng A-row fit vs SVD fit (Train-Test Gap)
-**Training:** $\text{fit}_{\text{cur}} = \text{A-row fit} + \beta$
-**Inference:** $\text{fit}_{\text{cur}} = \text{A-row fit}$ (no bias)
-Hai formula đo fit trên **hai thang đo khác nhau**:
-- **A-row fit** (current task): $\frac{\sum_i (a_i \cdot h)^2}{r \|h\|^2}$ — uniform weighting
-- **SVD fit** (old tasks): $\frac{\sum_i \sigma_i^2 (v_i \cdot h)^2}{\sum_i \sigma_i^2 \|h\|^2}$ — $\sigma^2$-weighted
-Sau null-space projection, A rows bị constrained vào subspace hẹp → A-row fit **hệ thống thấp hơn** SVD fit → old tasks luôn thắng routing tại inference.
-### 2.3 BUG THỨ BA: Threshold quá thấp (0.980 vs ROOT 0.995)
-- threshold = 0.980 → mỗi task chiếm **nhiều hơn** null-space
-- Sau 7 tasks: null-space còn lại cho task 8 rất hẹp
-- A_8 rows bị project vào null-space nhỏ → A-row fit cực thấp
-- ROOT dùng 0.995 → mỗi task chiếm ít null-space hơn → duy trì capacity cho tasks sau
----
-## 3. Phân tích lý thuyết (Theory-Backed)
-### 3.1 Softmax Competition Bias (Information Theory)
-Softmax + constant bias vi phạm **principle of maximum entropy** khi số task tăng. Với $n$ tasks và fit gần nhau, softmax converge về phân phối uniform $1/n$ (maximum entropy). Constant bias $\beta$ không đủ để chống lại entropy này khi $n$ lớn.
-**Adaptive bias derivation:** Muốn current task đạt weight $\alpha$ cố định:
-$$\alpha = \frac{e^{(f + \beta)/T}}{e^{(f + \beta)/T} + (n-1)e^{f/T}}$$
-Giải cho $\beta$:
-$$\boxed{\beta = T \cdot \ln\left(\frac{\alpha(n-1)}{1-\alpha}\right)}$$
-Với $\alpha = 0.8$, $T = 1.0$:
-- n=2: $\beta$ = 1.39 → w = 80%
-- n=8: $\beta$ = 3.33 → w = 80%
-- n=15: $\beta$ = 4.03 → w = 80%
-**Kết nối paper**: Tương tự "bias correction" trong Adam optimizer — bias phải thay đổi theo thời gian để duy trì tính chất thống kê mong muốn.
-### 3.2 Rayleigh Quotient Symmetry (Linear Algebra)
-Fit formula hiện tại vi phạm **measurement symmetry**: current task và old tasks dùng metric khác nhau. Trong Grassmannian geometry, khoảng cách giữa hai subspace phải dùng cùng một metric.
-**Weighted Rayleigh quotient** (chuẩn cho cả hai):
-$$\text{fit}_k(h) = \frac{\sum_{i=1}^r \sigma_{k,i}^2 (v_{k,i} \cdot h)^2}{\sum_{i=1}^r \sigma_{k,i}^2 \cdot \|h\|^2}$$
-Tại inference, current task cũng phải dùng SVD-based fit (SVD có sẵn vì B ≠ 0 sau training).
-**Kết nối paper**: Principal angle theory (Björck & Golub, 1973) — khoảng cách giữa subspaces phải đo bằng canonical angles, tương đương $\sigma$-weighted Rayleigh quotient.
-### 3.3 Null-Space Capacity Bound (GPM Theory)
-Từ GPM (Saha et al., 2021): với threshold $\tau$, mỗi task chiếm $\leq r(1-\tau)$ dimensions. Capacity cho $n$ tasks:
-$$n_{\max} = \left\lfloor \frac{d}{r(1-\tau)} \right\rfloor$$
-Với d=512, r=8:
-- $\tau$ = 0.995 → capacity = 512/(8×0.005) = **12,800 tasks** (rất dư)
-- $\tau$ = 0.980 → capacity = 512/(8×0.020) = **3,200 tasks** (vẫn dư nhưng aggressive hơn)
-Threshold 0.995 bảo vệ nhiều capacity hơn cho tasks sau.
----
-## 4. Kế hoạch sửa: SpecRoute V3
-### Fix 1: Adaptive Training Bias (Bắt buộc)
-```
-β = T · ln(α · (n_old) / (1-α))    khi n_old ≥ 1
-β = 0                                khi n_old = 0
-```
-- `n_old = len(self.spectral_signatures)` — tự động từ số signatures đã load
-- `α = target_routing_alpha` — config parameter, default 0.8
-- Đảm bảo w_cur ≈ 80% bất kể số task
-### Fix 2: Symmetric Inference Routing (Bắt buộc)
-- **Training**: Giữ A-row fit + adaptive bias (cold-start compatible)
-- **Inference**: Tính SVD(B@A) cho current task → dùng SVD fit cho TẤT CẢ tasks
-- Method: `prepare_inference_routing()` — gọi 1 lần trước inference
-- Loại bỏ hoàn toàn asymmetry A-row vs SVD
-### Fix 3: Threshold = 0.995 (Match ROOT)
-- Chỉ thay đổi trong shell script
-- Giảm null-space consumption per task
-- Bảo toàn capacity cho tasks sau
-### Không thêm gì khác
-- Không thêm KL replay (vi phạm zero-replay settings)
-- Không thêm learned routing parameters (mất novelty parameter-free)
-- Không thay đổi optimizer/lr/scheduler
-- Tôn trọng nguyên tắc: "chỉ cải thiện implement, không over-engineer"
----
-## 5. Code Changes Cụ Thể
-### File 1: `t5_specroute.py`
-1. Thêm `prepare_inference_routing()` method vào T5Stack
-2. Sửa `compute_spectral_routing()`:
-   - Training: A-row fit + `adaptive_training_bias` (computed from α and n_old)
-   - Inference: SVD fit từ `_current_task_svd` (precomputed)
-3. Thêm property `adaptive_training_bias`
-### File 2: `run_t5.py`
-1. Thêm `target_routing_alpha` vào prompt_config
-2. Gọi `model.encoder.prepare_inference_routing()` trước inference
-### File 3: `gen_script_long_order3_t5_small_specroute_v3.sh`
-1. `--threshold 0.995`
-2. `--transthreshold 0.995`
-3. `--target_routing_alpha 0.8`
-4. Output dir: `specroute_v3`

results/v5_deep_analysis.md ADDED Viewed

	@@ -0,0 +1,164 @@

+# V5 Deep Analysis — SpecRoute Prototype Routing
+## 1. Tổng quan
+V5 AP(EM)=59.55, rất gần ROOT=59.70 (Δ=-0.15). Đây là cải tiến vượt bậc so với V2 (30.73) và V3 (27.66).
+**Kết luận chính**: Prototype routing HOẠT ĐỘNG — giải quyết GPM-Routing Paradox, đưa performance từ ~50% ROOT lên ~100% ROOT.
+---
+## 2. Phân tích win/loss so với ROOT
+### V5 thắng ROOT (7 tasks, avg +2.84):
+| Task | Type | Δ EM | Nhận xét |
+|------|------|-----:|----------|
+| multirc | Reading comprehension | +9.92 | **Cải tiến lớn nhất**. Prototype routing chọn đúng LoRA cho task phức tạp |
+| wic | Word sense disambiguation | +2.98 | V3 fail hoàn toàn, V5 vượt ROOT |
+| rte | NLI (small) | +2.17 | V3 fail, V5 vượt ROOT |
+| copa | Causal reasoning | +2.00 | Slight improvement |
+| agnews | Topic classification | +1.37 | |
+| qqp | Paraphrase detection | +0.87 | |
+| boolq | Yes/No QA | +0.58 | |
+### V5 thua ROOT (8 tasks, avg -2.77):
+| Task | Type | Δ EM | Nhận xét |
+|------|------|-----:|----------|
+| yahoo | Topic classification | -7.62 | **Thua nhiều nhất**. Single-task quality gap |
+| amazon | Sentiment (5-class) | -4.04 | Single-task quality |
+| sst2 | Sentiment (binary) | -4.01 | Single-task quality |
+| cb | NLI (tiny) | -3.57 | Both near-zero, CB inherently broken |
+| yelp | Sentiment (5-class) | -1.37 | Small gap |
+| imdb | Sentiment (binary) | -0.90 | Small gap |
+| dbpedia | Topic classification | -0.49 | Negligible |
+| mnli | NLI (large) | -0.15 | Negligible |
+### Pattern Analysis
+**V5 thắng ở**: tasks phức tạp (multirc, wic, rte, copa, boolq) — đây đều là tasks cần routing chính xác. Prototype routing từ embedding space phân biệt tốt hơn MLP routing cho các task có cấu trúc input khác biệt.
+**V5 thua ở**: sentiment tasks (amazon, sst2, yelp, imdb) và topic classification (yahoo) — đây là các task V5 train_loss vẫn OK nhưng single-task quality thấp hơn ROOT.
+**Root cause**: SpecRoute KHÔNG có KL distillation loss (đã bỏ ở V2). ROOT dùng KL distill để transfer knowledge giữa tasks → sentiment tasks (giống nhau về domain) được benefit. SpecRoute dùng strict orthogonality (GPM) → mỗi task phải "learn from scratch" trong null-space → yếu hơn cho same-domain tasks.
+---
+## 3. Forgetting Pattern
+Average forgetting = -0.85 (rất thấp, tốt hơn expected cho 15-task).
+**Forgetting cao nhất**:
+- rte: -4.69 (trained 52.71 → final 48.01). rte được train sau qqp (task 7 sau task 6). multirc (task 13) và boolq (task 14) gây forgetting cho rte.
+- yahoo: -2.30 (51.96 → 49.66). Yahoo bị multirc và boolq gây forgetting nhẹ.
+- multirc: -1.46 (61.90 → 60.44). Chỉ boolq và wic sau multirc.
+**Zero forgetting**: cb (0→0, never learned), copa (44→44), qqp (77.82→77.83), boolq (61.01→61.01), wic (58.46→58.46).
+**Nhận xét**: GPM protection HOẠT ĐỘNG TỐT. Forgetting rất thấp. Vấn đề chính là SINGLE-TASK QUALITY, không phải forgetting.
+---
+## 4. CB Failure — Deep Dive
+CB (CommitmentBank) = 0.00 EM suốt training.
+**Dữ liệu**: 250 samples, 10 epochs, 8 steps/epoch = 80 total steps.
+Loss: 5.25 → 3.63 (giảm ~31% nhưng eval_em=0% suốt).
+**Tại sao fail?**
+1. **Extreme low-resource**: 250 samples quá ít cho 3-class NLI task
+2. **Epoch 10 vẫn chưa đủ**: Loss chưa converge (vẫn giảm ở step cuối)
+3. **ROOT cũng gần-fail**: EM=3.57 (chỉ 2/56 test samples đúng)
+4. **Task đặc thù**: CB answers = "entailment/contradiction/neutral" — 3 labels phức tạp, T5-small khó handle với 250 samples
+**Giải pháp khả dĩ** (KHÔNG vi phạm zero-replay):
+- Tăng epochs cho tiny datasets (ví dụ: epochs = max(10, 200/steps_per_epoch))
+- Sử dụng weight decay thấp hơn cho tiny datasets
+- **KHÔNG nên ưu tiên**: CB cũng fail ở ROOT, đây là limitation chung không phải của SpecRoute
+---
+## 5. Single-Task Quality Gap (Yahoo, SST2, Amazon)
+Đây là vấn đề quan trọng nhất cần giải quyết.
+### Yahoo (Δ = -7.62)
+- V5 trained=51.96, ROOT final=57.28
+- Train_loss=0.582 (moderate, not great)
+- Yahoo là task 12/15, có 10000 samples → đủ data
+- **Hypothesis**: Preconditioning + entropy regularization có thể đang interfere với learning cho large-scale topic classification. Hoặc orthogonality constraint quá strict → yahoo phải learn trong subspace nhỏ.
+### SST2 (Δ = -4.01)
+- V5 trained=81.42, ROOT=85.21
+- SST2 là task 9, after imdb (task 8 — cùng domain sentiment)
+- GPM forces SST2's A ⊥ IMDB's A → SST2 phải learn trong null-space restricted
+- ROOT's MLP routing cho phép SST2 share knowledge với IMDB → higher accuracy
+### Amazon (Δ = -4.04)
+- V5 trained=48.84, ROOT=52.05
+- Amazon (task 2) phải orthogonal với yelp (task 1) — cùng domain 5-class sentiment
+- Tương tự SST2: strict orthogonality hurts same-domain tasks
+### Kết luận: Nguyên nhân gốc là STRICT ORTHOGONALITY
+ROOT không bắt buộc LoRA directions phải orthogonal (GPM chỉ protect, routing qua MLP).
+SpecRoute + InfLoRA bắt buộc A_k ⊥ A_j (GPM on LoRA) → same-domain tasks phải learn trong subspace hạn chế → single-task quality giảm.
+Đây là trade-off cơ bản: **strict orthogonality = low forgetting BUT lower single-task quality**.
+---
+## 6. So sánh theoretical expectations
+Conversation summary ghi V5 expected AP(EM) = 40-55. Actual = 59.55 — **VƯỢT EXPECTATIONS**.
+Prototype routing giải quyết đúng bài toán đã phân tích (GPM-Routing Paradox). Kế hoạch relaxed orthogonality từ V5 design (η=0.1) chưa rõ có được implement không — cần verify.
+---
+## 7. Đề xuất cho V6 (bám sát research_rule.txt: theory → weakness → solution)
+### Weakness đã nhận diện
+**Single-task quality gap do strict orthogonality** — V5 forgetting chỉ -0.85 (gần zero) nhưng mất -2.77 EM trung bình trên 8 tasks so với ROOT.
+### Phân tích lý thuyết
+Strict InfLoRA: A_new ∈ null(P_old) where P_old = Σ A_i A_i^T.
+→ Remaining null-space shrinks: dim(null) = d − k·r (với k tasks, rank r).
+→ Same-domain tasks (yelp↔amazon, imdb↔sst2) cần similar directions nhưng bị forced vào orthogonal subspaces.
+→ B must compensate harder → lower quality.
+ROOT avoids this: LoRA GPM protects but routing qua learned MLP → tasks CAN share LoRA capacity implicitly.
+### Hướng V6 (đề xuất, cần phân tích thêm trước khi implement)
+**Option A: Relaxed Orthogonality (KHÔNG có trong V5 — đã verify)**
+- orthogonal_relaxation KHÔNG xuất hiện trong V5 script hay cl_trainer_specroute.py
+- V5 dùng STRICT orthogonality (pure InfLoRA GPM)
+- Đề xuất: A_new ∈ (1−η)·null(P_old) + η·P_old — cho phép small overlap
+- η ∈ [0.05, 0.2]: keep forgetting low nhưng improve same-domain quality
+**Option B: Task-Aware Learning Rate**
+- Tiny tasks (CB, COPA) dùng higher LR hoặc more epochs
+- Adaptive schedule: epochs = max(10, min_steps / steps_per_epoch)
+- Simple, no theory change needed
+**Option C: Prototype Quality Enhancement**
+- Current prototype = running mean of frozen embeddings → may not discriminate well between similar tasks
+- Could weight prototype by class-conditional means hoặc use PCA of embeddings
+- Cần verify cosine similarity matrix giữa prototypes (diagnostic log)
+### Priority
+1. ~~Verify nếu orthogonal_relaxation có active trong V5~~ → **ĐÃ VERIFY: KHÔNG CÓ** (strict orthogonality)
+2. **Option A**: Implement relaxed orthogonality — có tiềm năng lớn nhất
+3. **Option B** cho CB/COPA: simple fix, no theoretical risk
+4. **Option C** chỉ nếu Option A không đủ
+---
+## 8. Kết luận
+V5 là milestone quan trọng: prototype routing **chứng minh GPM-Routing Paradox analysis đúng** và **giải quyết được 5/6 task failures**. AP ngang ROOT (59.55 vs 59.70) là kết quả excellent cho parameter-free routing (so với ROOT's learned MLP routing).
+Bottleneck hiện tại chuyển từ **routing quality** (đã giải quyết) sang **single-task learning quality** (do strict orthogonality). Đây đúng với C2 analysis (single-task quality) đã thảo luận trước đó.
+**Nếu giải quyết được single-task gap** (-2.77 avg trên 8 losing tasks), V6 có thể đạt AP(EM) ~62-63, vượt ROOT.