natmin322 commited on
Commit
9ea634d
·
1 Parent(s): dd291ef

SpecRoute V3: adaptive bias, symmetric inference, threshold 0.995, batch size optimization

Browse files

V3 fixes (3 targeted):
- Adaptive training bias: beta=T*ln(alpha*n/(1-alpha)), ensures ~80% routing weight
- Symmetric inference routing: prepare_inference_routing() computes SVD for current task
- Threshold 0.995: matches ROOT GPM capacity

Batch size optimization:
- T4_1gpu: BSZ 4->8, GA 8->4 (effective=32, ~2x fewer micro-batches)
- T4_2gpu: BSZ 2->4, GA 8->4 (effective=32)
- A100: BSZ 8->32, GA 4->1 (effective=32)

Docs: V2 diagnosis report, V3 methodology in SPECROUTE_IDEA.md, experiment_versions.md

.gitignore CHANGED
@@ -14,4 +14,4 @@ logs/*
14
  logs
15
  *.log
16
  human_working_IdeaMethod_and_discuss/
17
-
 
14
  logs
15
  *.log
16
  human_working_IdeaMethod_and_discuss/
17
+ running_H100.txt
improve_gainlora/SPECROUTE_IDEA.md CHANGED
@@ -185,17 +185,27 @@ $$B_t,\, A_t \;\xrightarrow[\text{QR + SVD}]{O(dr^2)}\; (V_t,\, \boldsymbol{\sig
185
 
186
  ### C2 — Spectral Affinity Routing
187
 
188
- **Inference** (all tasks available):
189
 
190
  $$w(h) = \mathrm{softmax}\!\left(\frac{[\alpha_1(h),\; \ldots,\; \alpha_T(h)]}{\tau}\right)$$
191
 
 
 
192
  **Training** (task $t$, final SVD unknown because $B_t$ still training):
193
 
194
- $$\alpha_t^{\mathrm{train}}(h) = \frac{\|A_t\, h\|^2}{r\,\|h\|^2} + \beta$$
 
 
195
 
196
  **Justification of the A-row proxy:** For any full-rank $B_t$, the column span of $V_t$ (from SVD of $B_t A_t$) equals $\mathrm{range}(A_t^\top)$. So the A rows span the *same* input subspace that the converged $V_t$ will capture. The proxy measures input alignment with this subspace using uniform weighting (no $\sigma$ available yet).
197
 
198
- **Justification of $\beta$:** A rows (kaiming-initialised, unit-variance) produce systematically lower fits than $\sigma^2$-weighted SVD fits of trained old experts. Setting $\beta = 1.0$ makes the softmax produce $w_t > 0.95$, approximating the oracle assignment $w_t = 1$ (principled: during training on task $t$'s data, the optimal routing *is* $w_t = 1$) while allowing marginal knowledge transfer from relevant old experts.
 
 
 
 
 
 
199
 
200
  ### C3 — Capacity-Aware Subspace Allocation
201
 
@@ -240,8 +250,9 @@ where $\varepsilon_0$ is the base threshold. This allocates incrementally strict
240
  | Theory | Implementation | File |
241
  |--------|---------------|------|
242
  | Spectral signature $\mathcal{S}_t$ | `compute_spectral_signatures()` (thin QR+SVD) | `t5_specroute.py` |
243
- | Spectral affinity $\alpha_t(h)$ (old tasks) | σ²-weighted Rayleigh quotient | `compute_spectral_routing()` |
244
- | A-row proxy $\alpha_t^{\mathrm{train}}$ (current) | `(proj**2).sum() / (r * h_norm_sq) + training_bias` | `compute_spectral_routing()` |
 
245
  | Routing $w = \mathrm{softmax}(\alpha / \tau)$ | `torch.softmax(fit_scores / temp)` | `compute_spectral_routing()` |
246
  | Drift-free input $h$ | `inputs_embeds = self.embed_tokens(input_ids)` → mean-pool | `T5Stack.forward()` |
247
  | GPM + InfLoRA null-space | `get_reg_matrix()` | `cl_trainer_specroute.py` |
@@ -262,8 +273,8 @@ where $\varepsilon_0$ is the base threshold. This allocates incrementally strict
262
  ### Task $t \geq 2$
263
  1. Load model + fresh LoRA; load old LoRA weights and spectral signatures.
264
  2. InfLoRA: project current $A_t$ into null-space of old GPM bases.
265
- 3. Train `lora_B` with spectral affinity routing + training bias $\beta$.
266
- 4. Post-training: compute $\mathcal{S}_t$ + update GPM bases.
267
  5. Save all artifacts for next task.
268
 
269
  ---
@@ -276,8 +287,8 @@ where $\varepsilon_0$ is the base threshold. This allocates incrementally strict
276
  | Benchmarks | SuperNI (15 tasks, 2 orderings), Long (15 tasks, 2 orderings) |
277
  | Metrics | AP (Average Performance, ↑), FT (Forgetting, ↓) |
278
  | LoRA | $r = 4$, $\alpha = 32$, dropout 0.0 |
279
- | Routing | $\tau = 1.0$, $\beta = 1.0$ (train only) |
280
- | ESA | $\varepsilon_0 = 0.980$ (dynamic) |
281
  | Precision | fp32 + gradient checkpointing |
282
  | Comparison | Batch size, LR, scheduler match ROOT (GainLoRA) exactly |
283
 
 
185
 
186
  ### C2 — Spectral Affinity Routing
187
 
188
+ **Inference** (all tasks available, symmetric SVD-based routing):
189
 
190
  $$w(h) = \mathrm{softmax}\!\left(\frac{[\alpha_1(h),\; \ldots,\; \alpha_T(h)]}{\tau}\right)$$
191
 
192
+ All tasks (including the most recently trained) use the same $\sigma^2$-weighted spectral affinity formula (Definition 2). After training task $t$, we compute $\mathcal{S}_t$ once via `prepare_inference_routing()` and use it alongside old tasks' signatures.
193
+
194
  **Training** (task $t$, final SVD unknown because $B_t$ still training):
195
 
196
+ $$\alpha_t^{\mathrm{train}}(h) = \frac{\|A_t\, h\|^2}{r\,\|h\|^2} + \beta(n), \qquad \beta(n) = \tau \cdot \ln\!\left(\frac{\alpha_{\mathrm{target}} \cdot n}{1 - \alpha_{\mathrm{target}}}\right)$$
197
+
198
+ where $n = |\{\text{old tasks}\}|$ and $\alpha_{\mathrm{target}} \in (0,1)$ is the desired routing weight for the current task (default 0.8).
199
 
200
  **Justification of the A-row proxy:** For any full-rank $B_t$, the column span of $V_t$ (from SVD of $B_t A_t$) equals $\mathrm{range}(A_t^\top)$. So the A rows span the *same* input subspace that the converged $V_t$ will capture. The proxy measures input alignment with this subspace using uniform weighting (no $\sigma$ available yet).
201
 
202
+ **Justification of adaptive $\beta(n)$:** A constant bias $\beta_0$ causes the current task's softmax routing weight to decay as $O(1/n)$ with task count (softmax dilution). The adaptive formula normalises this: solving $w_t = \alpha_{\mathrm{target}}$ in the softmax equation yields the closed-form above. This ensures the current task receives routing weight $\approx \alpha_{\mathrm{target}}$ regardless of $n$, providing consistent gradient flow throughout the CL sequence.
203
+
204
+ *Derivation.* Let $f = \alpha_t^{\mathrm{train}}$, $g = \bar{\alpha}_{\mathrm{old}}$ (mean old task fit). The softmax weight for the current task among $n+1$ competitors:
205
+
206
+ $$w_t = \frac{e^{f/\tau}}{e^{f/\tau} + n\, e^{g/\tau}} = \alpha_{\mathrm{target}} \;\;\Longrightarrow\;\; \beta = f - g = \tau \cdot \ln\!\left(\frac{\alpha_{\mathrm{target}} \cdot n}{1 - \alpha_{\mathrm{target}}}\right)$$
207
+
208
+ **Justification of symmetric inference:** During training, the A-row proxy + adaptive bias is necessary because $B_t$ is evolving (cold-start). At inference, $B_t$ is frozen and $\Delta W_t = B_t A_t$ has well-defined SVD. Using the same $\sigma^2$-weighted Rayleigh quotient for all tasks ensures *measurement symmetry* — all affinities live on the same metric space, and the Routing–Protection Duality Theorem (Theorem 1) applies uniformly.
209
 
210
  ### C3 — Capacity-Aware Subspace Allocation
211
 
 
250
  | Theory | Implementation | File |
251
  |--------|---------------|------|
252
  | Spectral signature $\mathcal{S}_t$ | `compute_spectral_signatures()` (thin QR+SVD) | `t5_specroute.py` |
253
+ | Spectral affinity $\alpha_t(h)$ (all tasks at inference) | σ²-weighted Rayleigh quotient | `compute_spectral_routing()` |
254
+ | A-row proxy $\alpha_t^{\mathrm{train}}$ (current, training only) | A-row fit + adaptive bias $\beta(n)$ | `compute_spectral_routing()` |
255
+ | Symmetric inference SVD | `prepare_inference_routing()` → SVD of current $B_t A_t$ | `t5_specroute.py` |
256
  | Routing $w = \mathrm{softmax}(\alpha / \tau)$ | `torch.softmax(fit_scores / temp)` | `compute_spectral_routing()` |
257
  | Drift-free input $h$ | `inputs_embeds = self.embed_tokens(input_ids)` → mean-pool | `T5Stack.forward()` |
258
  | GPM + InfLoRA null-space | `get_reg_matrix()` | `cl_trainer_specroute.py` |
 
273
  ### Task $t \geq 2$
274
  1. Load model + fresh LoRA; load old LoRA weights and spectral signatures.
275
  2. InfLoRA: project current $A_t$ into null-space of old GPM bases.
276
+ 3. Train `lora_B` with spectral affinity routing + adaptive bias $\beta(n)$.
277
+ 4. Post-training: compute $\mathcal{S}_t$ (`prepare_inference_routing` for inference, `compute_spectral_signatures` for storage) + update GPM bases.
278
  5. Save all artifacts for next task.
279
 
280
  ---
 
287
  | Benchmarks | SuperNI (15 tasks, 2 orderings), Long (15 tasks, 2 orderings) |
288
  | Metrics | AP (Average Performance, ↑), FT (Forgetting, ↓) |
289
  | LoRA | $r = 4$, $\alpha = 32$, dropout 0.0 |
290
+ | Routing | $\tau = 1.0$, $\alpha_{\mathrm{target}} = 0.8$, adaptive $\beta(n)$ (train); symmetric SVD (inference) |
291
+ | ESA | $\varepsilon_0 = 0.995$ (dynamic) |
292
  | Precision | fp32 + gradient checkpointing |
293
  | Comparison | Batch size, LR, scheduler match ROOT (GainLoRA) exactly |
294
 
improve_gainlora/T5_small/gen_script_long_order3_t5_small_specroute_v3.sh ADDED
@@ -0,0 +1,893 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ #SBATCH -J cl
3
+ #SBATCH -o cl-%j.out
4
+ #SBATCH -p compute
5
+ #SBATCH -N 1
6
+ #SBATCH -t 20:00:00
7
+ #SBATCH --mem 128G
8
+ #SBATCH --gres=gpu:2
9
+
10
+ export CUDA_DEVICE_ORDER="PCI_BUS_ID"
11
+
12
+ port=$(shuf -i25000-30000 -n1)
13
+
14
+ # ============================================================
15
+ # Auto-detect GPU count and type for optimal parallelism
16
+ # ============================================================
17
+ NUM_GPUS=$(nvidia-smi -L 2>/dev/null | wc -l)
18
+ GPU_MEM=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits 2>/dev/null | head -1)
19
+
20
+ if [ -z "$GPU_MEM" ]; then
21
+ echo "ERROR: No GPU detected!"
22
+ exit 1
23
+ fi
24
+
25
+ # Determine GPU type
26
+ if [ "$GPU_MEM" -lt 20000 ]; then
27
+ IS_T4=1
28
+ echo "[GPU] Detected T4 GPUs (${GPU_MEM}MB VRAM each)"
29
+ else
30
+ IS_T4=0
31
+ echo "[GPU] Detected high-memory GPUs (${GPU_MEM}MB VRAM each)"
32
+ fi
33
+
34
+ # Determine parallelism strategy
35
+ if [ "$IS_T4" -eq 1 ] && [ "$NUM_GPUS" -ge 2 ]; then
36
+ GPU_MODE="t4_2gpu"
37
+ GPU_IDS="0,1"
38
+ FP16_FLAG=""
39
+ echo "[GPU] Strategy: 2x T4 DataParallel + fp32 + gradient_checkpointing"
40
+ elif [ "$IS_T4" -eq 1 ]; then
41
+ GPU_MODE="t4_1gpu"
42
+ GPU_IDS="${1:-0}"
43
+ FP16_FLAG=""
44
+ echo "[GPU] Strategy: 1x T4 + fp32 + gradient_checkpointing"
45
+ else
46
+ GPU_MODE="a100"
47
+ GPU_IDS="${1:-0}"
48
+ FP16_FLAG=""
49
+ echo "[GPU] Strategy: A100 (single GPU, fp32)"
50
+ fi
51
+
52
+ echo "[GPU] Using CUDA_VISIBLE_DEVICES=$GPU_IDS"
53
+ echo "============================================================"
54
+ echo ""
55
+
56
+ if [ "$GPU_MODE" = "t4_2gpu" ]; then
57
+ BSZ=4; GA=4; EVAL_BSZ=128
58
+ elif [ "$GPU_MODE" = "t4_1gpu" ]; then
59
+ BSZ=8; GA=4; EVAL_BSZ=128
60
+ else
61
+ BSZ=32; GA=1; EVAL_BSZ=128
62
+ fi
63
+
64
+ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
65
+ --do_train \
66
+ --do_predict \
67
+ --predict_with_generate \
68
+ --model_name_or_path $2 \
69
+ --data_dir CL_Benchmark \
70
+ --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
71
+ --task_config_dir configs/gen_script_long_order3_t5_configs/yelp \
72
+ --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp \
73
+ --per_device_train_batch_size $BSZ \
74
+ --per_device_eval_batch_size $EVAL_BSZ \
75
+ --gradient_accumulation_steps $GA \
76
+ --learning_rate 0.0003 \
77
+ --num_train_epochs 10 \
78
+ --run_name gen_script_long_order3_t5_small_specroute_v3 \
79
+ --max_source_length 512 \
80
+ --max_target_length 50 \
81
+ --generation_max_length 50 \
82
+ --add_task_name False \
83
+ --add_dataset_name False \
84
+ --overwrite_output_dir \
85
+ --overwrite_cache \
86
+ --lr_scheduler_type constant \
87
+ --warmup_steps 0 \
88
+ --logging_strategy steps \
89
+ --logging_steps 10 \
90
+ --metric_for_best_model eval_exact_match \
91
+ --evaluation_strategy epoch \
92
+ --save_strategy epoch \
93
+ --save_total_limit 1 \
94
+ --load_best_model_at_end \
95
+ --lora_r 8 \
96
+ --lora_alpha 32 \
97
+ --lora_dropout 0.0 \
98
+ --data_replay_freq -1 \
99
+ --mlp_hidden_dim 100 \
100
+ --model_name specroute \
101
+ --target_routing_alpha 0.8 \
102
+ --gen_data_dir CL_Benchmark \
103
+ --threshold 0.995 \
104
+ --transthreshold 0.995 \
105
+ $FP16_FLAG
106
+
107
+ rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/checkpoint*
108
+
109
+ sleep 5
110
+
111
+ if [ "$GPU_MODE" = "t4_2gpu" ]; then
112
+ BSZ=4; GA=4; EVAL_BSZ=128
113
+ elif [ "$GPU_MODE" = "t4_1gpu" ]; then
114
+ BSZ=8; GA=4; EVAL_BSZ=128
115
+ else
116
+ BSZ=32; GA=1; EVAL_BSZ=128
117
+ fi
118
+
119
+ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
120
+ --do_train \
121
+ --do_predict \
122
+ --predict_with_generate \
123
+ --model_name_or_path $2 \
124
+ --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights \
125
+ --data_dir CL_Benchmark \
126
+ --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
127
+ --task_config_dir configs/gen_script_long_order3_t5_configs/amazon \
128
+ --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon \
129
+ --per_device_train_batch_size $BSZ \
130
+ --per_device_eval_batch_size $EVAL_BSZ \
131
+ --gradient_accumulation_steps $GA \
132
+ --learning_rate 0.0003 \
133
+ --num_train_epochs 10 \
134
+ --run_name gen_script_long_order3_t5_small_specroute_v3 \
135
+ --max_source_length 512 \
136
+ --max_target_length 50 \
137
+ --generation_max_length 50 \
138
+ --add_task_name False \
139
+ --add_dataset_name False \
140
+ --overwrite_output_dir \
141
+ --overwrite_cache \
142
+ --lr_scheduler_type constant \
143
+ --warmup_steps 0 \
144
+ --logging_strategy steps \
145
+ --logging_steps 10 \
146
+ --metric_for_best_model eval_exact_match_for_amazon \
147
+ --evaluation_strategy epoch \
148
+ --save_strategy epoch \
149
+ --save_total_limit 1 \
150
+ --load_best_model_at_end \
151
+ --lora_r 8 \
152
+ --lora_alpha 32 \
153
+ --lora_dropout 0.0 \
154
+ --data_replay_freq -1 \
155
+ --mlp_hidden_dim 100 \
156
+ --model_name specroute \
157
+ --target_routing_alpha 0.8 \
158
+ --gen_data_dir CL_Benchmark \
159
+ --threshold 0.995 \
160
+ --transthreshold 0.995 \
161
+ $FP16_FLAG
162
+
163
+ rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/checkpoint*
164
+
165
+ sleep 5
166
+
167
+ if [ "$GPU_MODE" = "t4_2gpu" ]; then
168
+ BSZ=4; GA=4; EVAL_BSZ=128
169
+ elif [ "$GPU_MODE" = "t4_1gpu" ]; then
170
+ BSZ=8; GA=4; EVAL_BSZ=128
171
+ else
172
+ BSZ=32; GA=1; EVAL_BSZ=128
173
+ fi
174
+
175
+ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
176
+ --do_train \
177
+ --do_predict \
178
+ --predict_with_generate \
179
+ --model_name_or_path $2 \
180
+ --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights \
181
+ --data_dir CL_Benchmark \
182
+ --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
183
+ --task_config_dir configs/gen_script_long_order3_t5_configs/mnli \
184
+ --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli \
185
+ --per_device_train_batch_size $BSZ \
186
+ --per_device_eval_batch_size $EVAL_BSZ \
187
+ --gradient_accumulation_steps $GA \
188
+ --learning_rate 0.0003 \
189
+ --num_train_epochs 10 \
190
+ --run_name gen_script_long_order3_t5_small_specroute_v3 \
191
+ --max_source_length 512 \
192
+ --max_target_length 50 \
193
+ --generation_max_length 50 \
194
+ --add_task_name False \
195
+ --add_dataset_name False \
196
+ --overwrite_output_dir \
197
+ --overwrite_cache \
198
+ --lr_scheduler_type constant \
199
+ --warmup_steps 0 \
200
+ --logging_strategy steps \
201
+ --logging_steps 10 \
202
+ --metric_for_best_model eval_exact_match_for_mnli \
203
+ --evaluation_strategy epoch \
204
+ --save_strategy epoch \
205
+ --save_total_limit 1 \
206
+ --load_best_model_at_end \
207
+ --lora_r 8 \
208
+ --lora_alpha 32 \
209
+ --lora_dropout 0.0 \
210
+ --data_replay_freq -1 \
211
+ --mlp_hidden_dim 100 \
212
+ --model_name specroute \
213
+ --target_routing_alpha 0.8 \
214
+ --gen_data_dir CL_Benchmark \
215
+ --threshold 0.995 \
216
+ --transthreshold 0.995 \
217
+ $FP16_FLAG
218
+
219
+ rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/checkpoint*
220
+
221
+ sleep 5
222
+
223
+ if [ "$GPU_MODE" = "t4_2gpu" ]; then
224
+ BSZ=4; GA=4; EVAL_BSZ=128
225
+ elif [ "$GPU_MODE" = "t4_1gpu" ]; then
226
+ BSZ=8; GA=4; EVAL_BSZ=128
227
+ else
228
+ BSZ=32; GA=1; EVAL_BSZ=128
229
+ fi
230
+
231
+ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
232
+ --do_train \
233
+ --do_predict \
234
+ --predict_with_generate \
235
+ --model_name_or_path $2 \
236
+ --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights \
237
+ --data_dir CL_Benchmark \
238
+ --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
239
+ --task_config_dir configs/gen_script_long_order3_t5_configs/cb \
240
+ --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb \
241
+ --per_device_train_batch_size $BSZ \
242
+ --per_device_eval_batch_size $EVAL_BSZ \
243
+ --gradient_accumulation_steps $GA \
244
+ --learning_rate 0.0003 \
245
+ --num_train_epochs 10 \
246
+ --run_name gen_script_long_order3_t5_small_specroute_v3 \
247
+ --max_source_length 512 \
248
+ --max_target_length 50 \
249
+ --generation_max_length 50 \
250
+ --add_task_name False \
251
+ --add_dataset_name False \
252
+ --overwrite_output_dir \
253
+ --overwrite_cache \
254
+ --lr_scheduler_type constant \
255
+ --warmup_steps 0 \
256
+ --logging_strategy steps \
257
+ --logging_steps 10 \
258
+ --metric_for_best_model eval_exact_match_for_cb \
259
+ --evaluation_strategy epoch \
260
+ --save_strategy epoch \
261
+ --save_total_limit 1 \
262
+ --load_best_model_at_end \
263
+ --lora_r 8 \
264
+ --lora_alpha 32 \
265
+ --lora_dropout 0.0 \
266
+ --data_replay_freq -1 \
267
+ --mlp_hidden_dim 100 \
268
+ --model_name specroute \
269
+ --target_routing_alpha 0.8 \
270
+ --gen_data_dir CL_Benchmark \
271
+ --threshold 0.995 \
272
+ --transthreshold 0.995 \
273
+ $FP16_FLAG
274
+
275
+ rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/checkpoint*
276
+
277
+ sleep 5
278
+
279
+ if [ "$GPU_MODE" = "t4_2gpu" ]; then
280
+ BSZ=4; GA=4; EVAL_BSZ=128
281
+ elif [ "$GPU_MODE" = "t4_1gpu" ]; then
282
+ BSZ=8; GA=4; EVAL_BSZ=128
283
+ else
284
+ BSZ=32; GA=1; EVAL_BSZ=128
285
+ fi
286
+
287
+ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
288
+ --do_train \
289
+ --do_predict \
290
+ --predict_with_generate \
291
+ --model_name_or_path $2 \
292
+ --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights \
293
+ --data_dir CL_Benchmark \
294
+ --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
295
+ --task_config_dir configs/gen_script_long_order3_t5_configs/copa \
296
+ --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa \
297
+ --per_device_train_batch_size $BSZ \
298
+ --per_device_eval_batch_size $EVAL_BSZ \
299
+ --gradient_accumulation_steps $GA \
300
+ --learning_rate 0.0003 \
301
+ --num_train_epochs 10 \
302
+ --run_name gen_script_long_order3_t5_small_specroute_v3 \
303
+ --max_source_length 512 \
304
+ --max_target_length 50 \
305
+ --generation_max_length 50 \
306
+ --add_task_name False \
307
+ --add_dataset_name False \
308
+ --overwrite_output_dir \
309
+ --overwrite_cache \
310
+ --lr_scheduler_type constant \
311
+ --warmup_steps 0 \
312
+ --logging_strategy steps \
313
+ --logging_steps 10 \
314
+ --metric_for_best_model eval_exact_match_for_copa \
315
+ --evaluation_strategy epoch \
316
+ --save_strategy epoch \
317
+ --save_total_limit 1 \
318
+ --load_best_model_at_end \
319
+ --lora_r 8 \
320
+ --lora_alpha 32 \
321
+ --lora_dropout 0.0 \
322
+ --data_replay_freq -1 \
323
+ --mlp_hidden_dim 100 \
324
+ --model_name specroute \
325
+ --target_routing_alpha 0.8 \
326
+ --gen_data_dir CL_Benchmark \
327
+ --threshold 0.995 \
328
+ --transthreshold 0.995 \
329
+ $FP16_FLAG
330
+
331
+ rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/checkpoint*
332
+
333
+ sleep 5
334
+
335
+ if [ "$GPU_MODE" = "t4_2gpu" ]; then
336
+ BSZ=4; GA=4; EVAL_BSZ=128
337
+ elif [ "$GPU_MODE" = "t4_1gpu" ]; then
338
+ BSZ=8; GA=4; EVAL_BSZ=128
339
+ else
340
+ BSZ=32; GA=1; EVAL_BSZ=128
341
+ fi
342
+
343
+ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
344
+ --do_train \
345
+ --do_predict \
346
+ --predict_with_generate \
347
+ --model_name_or_path $2 \
348
+ --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/saved_weights \
349
+ --data_dir CL_Benchmark \
350
+ --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
351
+ --task_config_dir configs/gen_script_long_order3_t5_configs/qqp \
352
+ --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp \
353
+ --per_device_train_batch_size $BSZ \
354
+ --per_device_eval_batch_size $EVAL_BSZ \
355
+ --gradient_accumulation_steps $GA \
356
+ --learning_rate 0.0003 \
357
+ --num_train_epochs 10 \
358
+ --run_name gen_script_long_order3_t5_small_specroute_v3 \
359
+ --max_source_length 512 \
360
+ --max_target_length 50 \
361
+ --generation_max_length 50 \
362
+ --add_task_name False \
363
+ --add_dataset_name False \
364
+ --overwrite_output_dir \
365
+ --overwrite_cache \
366
+ --lr_scheduler_type constant \
367
+ --warmup_steps 0 \
368
+ --logging_strategy steps \
369
+ --logging_steps 10 \
370
+ --metric_for_best_model eval_exact_match_for_qqp \
371
+ --evaluation_strategy epoch \
372
+ --save_strategy epoch \
373
+ --save_total_limit 1 \
374
+ --load_best_model_at_end \
375
+ --lora_r 8 \
376
+ --lora_alpha 32 \
377
+ --lora_dropout 0.0 \
378
+ --data_replay_freq -1 \
379
+ --mlp_hidden_dim 100 \
380
+ --model_name specroute \
381
+ --target_routing_alpha 0.8 \
382
+ --gen_data_dir CL_Benchmark \
383
+ --threshold 0.995 \
384
+ --transthreshold 0.995 \
385
+ $FP16_FLAG
386
+
387
+ rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp/checkpoint*
388
+
389
+ sleep 5
390
+
391
+ if [ "$GPU_MODE" = "t4_2gpu" ]; then
392
+ BSZ=4; GA=4; EVAL_BSZ=128
393
+ elif [ "$GPU_MODE" = "t4_1gpu" ]; then
394
+ BSZ=8; GA=4; EVAL_BSZ=128
395
+ else
396
+ BSZ=32; GA=1; EVAL_BSZ=128
397
+ fi
398
+
399
+ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
400
+ --do_train \
401
+ --do_predict \
402
+ --predict_with_generate \
403
+ --model_name_or_path $2 \
404
+ --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp/saved_weights \
405
+ --data_dir CL_Benchmark \
406
+ --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
407
+ --task_config_dir configs/gen_script_long_order3_t5_configs/rte \
408
+ --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/7-rte \
409
+ --per_device_train_batch_size $BSZ \
410
+ --per_device_eval_batch_size $EVAL_BSZ \
411
+ --gradient_accumulation_steps $GA \
412
+ --learning_rate 0.0003 \
413
+ --num_train_epochs 10 \
414
+ --run_name gen_script_long_order3_t5_small_specroute_v3 \
415
+ --max_source_length 512 \
416
+ --max_target_length 50 \
417
+ --generation_max_length 50 \
418
+ --add_task_name False \
419
+ --add_dataset_name False \
420
+ --overwrite_output_dir \
421
+ --overwrite_cache \
422
+ --lr_scheduler_type constant \
423
+ --warmup_steps 0 \
424
+ --logging_strategy steps \
425
+ --logging_steps 10 \
426
+ --metric_for_best_model eval_exact_match_for_rte \
427
+ --evaluation_strategy epoch \
428
+ --save_strategy epoch \
429
+ --save_total_limit 1 \
430
+ --load_best_model_at_end \
431
+ --lora_r 8 \
432
+ --lora_alpha 32 \
433
+ --lora_dropout 0.0 \
434
+ --data_replay_freq -1 \
435
+ --mlp_hidden_dim 100 \
436
+ --model_name specroute \
437
+ --target_routing_alpha 0.8 \
438
+ --gen_data_dir CL_Benchmark \
439
+ --threshold 0.995 \
440
+ --transthreshold 0.995 \
441
+ $FP16_FLAG
442
+
443
+ rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/7-rte/checkpoint*
444
+
445
+ sleep 5
446
+
447
+ if [ "$GPU_MODE" = "t4_2gpu" ]; then
448
+ BSZ=4; GA=4; EVAL_BSZ=128
449
+ elif [ "$GPU_MODE" = "t4_1gpu" ]; then
450
+ BSZ=8; GA=4; EVAL_BSZ=128
451
+ else
452
+ BSZ=32; GA=1; EVAL_BSZ=128
453
+ fi
454
+
455
+ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
456
+ --do_train \
457
+ --do_predict \
458
+ --predict_with_generate \
459
+ --model_name_or_path $2 \
460
+ --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/7-rte/saved_weights \
461
+ --data_dir CL_Benchmark \
462
+ --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
463
+ --task_config_dir configs/gen_script_long_order3_t5_configs/imdb \
464
+ --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/8-imdb \
465
+ --per_device_train_batch_size $BSZ \
466
+ --per_device_eval_batch_size $EVAL_BSZ \
467
+ --gradient_accumulation_steps $GA \
468
+ --learning_rate 0.0003 \
469
+ --num_train_epochs 10 \
470
+ --run_name gen_script_long_order3_t5_small_specroute_v3 \
471
+ --max_source_length 512 \
472
+ --max_target_length 50 \
473
+ --generation_max_length 50 \
474
+ --add_task_name False \
475
+ --add_dataset_name False \
476
+ --overwrite_output_dir \
477
+ --overwrite_cache \
478
+ --lr_scheduler_type constant \
479
+ --warmup_steps 0 \
480
+ --logging_strategy steps \
481
+ --logging_steps 10 \
482
+ --metric_for_best_model eval_exact_match_for_imdb \
483
+ --evaluation_strategy epoch \
484
+ --save_strategy epoch \
485
+ --save_total_limit 1 \
486
+ --load_best_model_at_end \
487
+ --lora_r 8 \
488
+ --lora_alpha 32 \
489
+ --lora_dropout 0.0 \
490
+ --data_replay_freq -1 \
491
+ --mlp_hidden_dim 100 \
492
+ --model_name specroute \
493
+ --target_routing_alpha 0.8 \
494
+ --gen_data_dir CL_Benchmark \
495
+ --threshold 0.995 \
496
+ --transthreshold 0.995 \
497
+ $FP16_FLAG
498
+
499
+ rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/8-imdb/checkpoint*
500
+
501
+ sleep 5
502
+
503
+ if [ "$GPU_MODE" = "t4_2gpu" ]; then
504
+ BSZ=4; GA=4; EVAL_BSZ=128
505
+ elif [ "$GPU_MODE" = "t4_1gpu" ]; then
506
+ BSZ=8; GA=4; EVAL_BSZ=128
507
+ else
508
+ BSZ=32; GA=1; EVAL_BSZ=128
509
+ fi
510
+
511
+ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
512
+ --do_train \
513
+ --do_predict \
514
+ --predict_with_generate \
515
+ --model_name_or_path $2 \
516
+ --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/7-rte/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/8-imdb/saved_weights \
517
+ --data_dir CL_Benchmark \
518
+ --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
519
+ --task_config_dir configs/gen_script_long_order3_t5_configs/sst2 \
520
+ --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/9-sst2 \
521
+ --per_device_train_batch_size $BSZ \
522
+ --per_device_eval_batch_size $EVAL_BSZ \
523
+ --gradient_accumulation_steps $GA \
524
+ --learning_rate 0.0003 \
525
+ --num_train_epochs 10 \
526
+ --run_name gen_script_long_order3_t5_small_specroute_v3 \
527
+ --max_source_length 512 \
528
+ --max_target_length 50 \
529
+ --generation_max_length 50 \
530
+ --add_task_name False \
531
+ --add_dataset_name False \
532
+ --overwrite_output_dir \
533
+ --overwrite_cache \
534
+ --lr_scheduler_type constant \
535
+ --warmup_steps 0 \
536
+ --logging_strategy steps \
537
+ --logging_steps 10 \
538
+ --metric_for_best_model eval_exact_match_for_sst2 \
539
+ --evaluation_strategy epoch \
540
+ --save_strategy epoch \
541
+ --save_total_limit 1 \
542
+ --load_best_model_at_end \
543
+ --lora_r 8 \
544
+ --lora_alpha 32 \
545
+ --lora_dropout 0.0 \
546
+ --data_replay_freq -1 \
547
+ --mlp_hidden_dim 100 \
548
+ --model_name specroute \
549
+ --target_routing_alpha 0.8 \
550
+ --gen_data_dir CL_Benchmark \
551
+ --threshold 0.995 \
552
+ --transthreshold 0.995 \
553
+ $FP16_FLAG
554
+
555
+ rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/9-sst2/checkpoint*
556
+
557
+ sleep 5
558
+
559
+ if [ "$GPU_MODE" = "t4_2gpu" ]; then
560
+ BSZ=4; GA=4; EVAL_BSZ=128
561
+ elif [ "$GPU_MODE" = "t4_1gpu" ]; then
562
+ BSZ=8; GA=4; EVAL_BSZ=128
563
+ else
564
+ BSZ=32; GA=1; EVAL_BSZ=128
565
+ fi
566
+
567
+ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
568
+ --do_train \
569
+ --do_predict \
570
+ --predict_with_generate \
571
+ --model_name_or_path $2 \
572
+ --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/7-rte/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/8-imdb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/9-sst2/saved_weights \
573
+ --data_dir CL_Benchmark \
574
+ --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
575
+ --task_config_dir configs/gen_script_long_order3_t5_configs/dbpedia \
576
+ --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/10-dbpedia \
577
+ --per_device_train_batch_size $BSZ \
578
+ --per_device_eval_batch_size $EVAL_BSZ \
579
+ --gradient_accumulation_steps $GA \
580
+ --learning_rate 0.0003 \
581
+ --num_train_epochs 10 \
582
+ --run_name gen_script_long_order3_t5_small_specroute_v3 \
583
+ --max_source_length 512 \
584
+ --max_target_length 50 \
585
+ --generation_max_length 50 \
586
+ --add_task_name False \
587
+ --add_dataset_name False \
588
+ --overwrite_output_dir \
589
+ --overwrite_cache \
590
+ --lr_scheduler_type constant \
591
+ --warmup_steps 0 \
592
+ --logging_strategy steps \
593
+ --logging_steps 10 \
594
+ --metric_for_best_model eval_exact_match_for_dbpedia \
595
+ --evaluation_strategy epoch \
596
+ --save_strategy epoch \
597
+ --save_total_limit 1 \
598
+ --load_best_model_at_end \
599
+ --lora_r 8 \
600
+ --lora_alpha 32 \
601
+ --lora_dropout 0.0 \
602
+ --data_replay_freq -1 \
603
+ --mlp_hidden_dim 100 \
604
+ --model_name specroute \
605
+ --target_routing_alpha 0.8 \
606
+ --gen_data_dir CL_Benchmark \
607
+ --threshold 0.995 \
608
+ --transthreshold 0.995 \
609
+ $FP16_FLAG
610
+
611
+ rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/10-dbpedia/checkpoint*
612
+
613
+ sleep 5
614
+
615
+ if [ "$GPU_MODE" = "t4_2gpu" ]; then
616
+ BSZ=4; GA=4; EVAL_BSZ=128
617
+ elif [ "$GPU_MODE" = "t4_1gpu" ]; then
618
+ BSZ=8; GA=4; EVAL_BSZ=128
619
+ else
620
+ BSZ=32; GA=1; EVAL_BSZ=128
621
+ fi
622
+
623
+ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
624
+ --do_train \
625
+ --do_predict \
626
+ --predict_with_generate \
627
+ --model_name_or_path $2 \
628
+ --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/7-rte/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/8-imdb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/9-sst2/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/10-dbpedia/saved_weights \
629
+ --data_dir CL_Benchmark \
630
+ --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
631
+ --task_config_dir configs/gen_script_long_order3_t5_configs/agnews \
632
+ --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/11-agnews \
633
+ --per_device_train_batch_size $BSZ \
634
+ --per_device_eval_batch_size $EVAL_BSZ \
635
+ --gradient_accumulation_steps $GA \
636
+ --learning_rate 0.0003 \
637
+ --num_train_epochs 10 \
638
+ --run_name gen_script_long_order3_t5_small_specroute_v3 \
639
+ --max_source_length 512 \
640
+ --max_target_length 50 \
641
+ --generation_max_length 50 \
642
+ --add_task_name False \
643
+ --add_dataset_name False \
644
+ --overwrite_output_dir \
645
+ --overwrite_cache \
646
+ --lr_scheduler_type constant \
647
+ --warmup_steps 0 \
648
+ --logging_strategy steps \
649
+ --logging_steps 10 \
650
+ --metric_for_best_model eval_exact_match_for_agnews \
651
+ --evaluation_strategy epoch \
652
+ --save_strategy epoch \
653
+ --save_total_limit 1 \
654
+ --load_best_model_at_end \
655
+ --lora_r 8 \
656
+ --lora_alpha 32 \
657
+ --lora_dropout 0.0 \
658
+ --data_replay_freq -1 \
659
+ --mlp_hidden_dim 100 \
660
+ --model_name specroute \
661
+ --target_routing_alpha 0.8 \
662
+ --gen_data_dir CL_Benchmark \
663
+ --threshold 0.995 \
664
+ --transthreshold 0.995 \
665
+ $FP16_FLAG
666
+
667
+ rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/11-agnews/checkpoint*
668
+
669
+ sleep 5
670
+
671
+ if [ "$GPU_MODE" = "t4_2gpu" ]; then
672
+ BSZ=4; GA=4; EVAL_BSZ=128
673
+ elif [ "$GPU_MODE" = "t4_1gpu" ]; then
674
+ BSZ=8; GA=4; EVAL_BSZ=128
675
+ else
676
+ BSZ=32; GA=1; EVAL_BSZ=128
677
+ fi
678
+
679
+ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
680
+ --do_train \
681
+ --do_predict \
682
+ --predict_with_generate \
683
+ --model_name_or_path $2 \
684
+ --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/7-rte/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/8-imdb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/9-sst2/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/10-dbpedia/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/11-agnews/saved_weights \
685
+ --data_dir CL_Benchmark \
686
+ --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
687
+ --task_config_dir configs/gen_script_long_order3_t5_configs/yahoo \
688
+ --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/12-yahoo \
689
+ --per_device_train_batch_size $BSZ \
690
+ --per_device_eval_batch_size $EVAL_BSZ \
691
+ --gradient_accumulation_steps $GA \
692
+ --learning_rate 0.0003 \
693
+ --num_train_epochs 10 \
694
+ --run_name gen_script_long_order3_t5_small_specroute_v3 \
695
+ --max_source_length 512 \
696
+ --max_target_length 50 \
697
+ --generation_max_length 50 \
698
+ --add_task_name False \
699
+ --add_dataset_name False \
700
+ --overwrite_output_dir \
701
+ --overwrite_cache \
702
+ --lr_scheduler_type constant \
703
+ --warmup_steps 0 \
704
+ --logging_strategy steps \
705
+ --logging_steps 10 \
706
+ --metric_for_best_model eval_exact_match_for_yahoo \
707
+ --evaluation_strategy epoch \
708
+ --save_strategy epoch \
709
+ --save_total_limit 1 \
710
+ --load_best_model_at_end \
711
+ --lora_r 8 \
712
+ --lora_alpha 32 \
713
+ --lora_dropout 0.0 \
714
+ --data_replay_freq -1 \
715
+ --mlp_hidden_dim 100 \
716
+ --model_name specroute \
717
+ --target_routing_alpha 0.8 \
718
+ --gen_data_dir CL_Benchmark \
719
+ --threshold 0.995 \
720
+ --transthreshold 0.995 \
721
+ $FP16_FLAG
722
+
723
+ rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/12-yahoo/checkpoint*
724
+
725
+ sleep 5
726
+
727
+ if [ "$GPU_MODE" = "t4_2gpu" ]; then
728
+ BSZ=4; GA=4; EVAL_BSZ=128
729
+ elif [ "$GPU_MODE" = "t4_1gpu" ]; then
730
+ BSZ=8; GA=4; EVAL_BSZ=128
731
+ else
732
+ BSZ=32; GA=1; EVAL_BSZ=128
733
+ fi
734
+
735
+ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
736
+ --do_train \
737
+ --do_predict \
738
+ --predict_with_generate \
739
+ --model_name_or_path $2 \
740
+ --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/7-rte/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/8-imdb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/9-sst2/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/10-dbpedia/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/11-agnews/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/12-yahoo/saved_weights \
741
+ --data_dir CL_Benchmark \
742
+ --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
743
+ --task_config_dir configs/gen_script_long_order3_t5_configs/multirc \
744
+ --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/13-multirc \
745
+ --per_device_train_batch_size $BSZ \
746
+ --per_device_eval_batch_size $EVAL_BSZ \
747
+ --gradient_accumulation_steps $GA \
748
+ --learning_rate 0.0003 \
749
+ --num_train_epochs 10 \
750
+ --run_name gen_script_long_order3_t5_small_specroute_v3 \
751
+ --max_source_length 512 \
752
+ --max_target_length 50 \
753
+ --generation_max_length 50 \
754
+ --add_task_name False \
755
+ --add_dataset_name False \
756
+ --overwrite_output_dir \
757
+ --overwrite_cache \
758
+ --lr_scheduler_type constant \
759
+ --warmup_steps 0 \
760
+ --logging_strategy steps \
761
+ --logging_steps 10 \
762
+ --metric_for_best_model eval_exact_match_for_multirc \
763
+ --evaluation_strategy epoch \
764
+ --save_strategy epoch \
765
+ --save_total_limit 1 \
766
+ --load_best_model_at_end \
767
+ --lora_r 8 \
768
+ --lora_alpha 32 \
769
+ --lora_dropout 0.0 \
770
+ --data_replay_freq -1 \
771
+ --mlp_hidden_dim 100 \
772
+ --model_name specroute \
773
+ --target_routing_alpha 0.8 \
774
+ --gen_data_dir CL_Benchmark \
775
+ --threshold 0.995 \
776
+ --transthreshold 0.995 \
777
+ $FP16_FLAG
778
+
779
+ rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/13-multirc/checkpoint*
780
+
781
+ sleep 5
782
+
783
+ if [ "$GPU_MODE" = "t4_2gpu" ]; then
784
+ BSZ=4; GA=4; EVAL_BSZ=128
785
+ elif [ "$GPU_MODE" = "t4_1gpu" ]; then
786
+ BSZ=8; GA=4; EVAL_BSZ=128
787
+ else
788
+ BSZ=32; GA=1; EVAL_BSZ=128
789
+ fi
790
+
791
+ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
792
+ --do_train \
793
+ --do_predict \
794
+ --predict_with_generate \
795
+ --model_name_or_path $2 \
796
+ --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/7-rte/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/8-imdb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/9-sst2/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/10-dbpedia/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/11-agnews/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/12-yahoo/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/13-multirc/saved_weights \
797
+ --data_dir CL_Benchmark \
798
+ --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
799
+ --task_config_dir configs/gen_script_long_order3_t5_configs/boolq \
800
+ --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/14-boolq \
801
+ --per_device_train_batch_size $BSZ \
802
+ --per_device_eval_batch_size $EVAL_BSZ \
803
+ --gradient_accumulation_steps $GA \
804
+ --learning_rate 0.0003 \
805
+ --num_train_epochs 10 \
806
+ --run_name gen_script_long_order3_t5_small_specroute_v3 \
807
+ --max_source_length 512 \
808
+ --max_target_length 50 \
809
+ --generation_max_length 50 \
810
+ --add_task_name False \
811
+ --add_dataset_name False \
812
+ --overwrite_output_dir \
813
+ --overwrite_cache \
814
+ --lr_scheduler_type constant \
815
+ --warmup_steps 0 \
816
+ --logging_strategy steps \
817
+ --logging_steps 10 \
818
+ --metric_for_best_model eval_exact_match_for_boolq \
819
+ --evaluation_strategy epoch \
820
+ --save_strategy epoch \
821
+ --save_total_limit 1 \
822
+ --load_best_model_at_end \
823
+ --lora_r 8 \
824
+ --lora_alpha 32 \
825
+ --lora_dropout 0.0 \
826
+ --data_replay_freq -1 \
827
+ --mlp_hidden_dim 100 \
828
+ --model_name specroute \
829
+ --target_routing_alpha 0.8 \
830
+ --gen_data_dir CL_Benchmark \
831
+ --threshold 0.995 \
832
+ --transthreshold 0.995 \
833
+ $FP16_FLAG
834
+
835
+ rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/14-boolq/checkpoint*
836
+
837
+ sleep 5
838
+
839
+ if [ "$GPU_MODE" = "t4_2gpu" ]; then
840
+ BSZ=4; GA=4; EVAL_BSZ=128
841
+ elif [ "$GPU_MODE" = "t4_1gpu" ]; then
842
+ BSZ=8; GA=4; EVAL_BSZ=128
843
+ else
844
+ BSZ=32; GA=1; EVAL_BSZ=128
845
+ fi
846
+
847
+ CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
848
+ --do_train \
849
+ --do_predict \
850
+ --predict_with_generate \
851
+ --model_name_or_path $2 \
852
+ --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/7-rte/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/8-imdb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/9-sst2/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/10-dbpedia/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/11-agnews/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/12-yahoo/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/13-multirc/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/14-boolq/saved_weights \
853
+ --data_dir CL_Benchmark \
854
+ --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
855
+ --task_config_dir configs/gen_script_long_order3_t5_configs/wic \
856
+ --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/15-wic \
857
+ --per_device_train_batch_size $BSZ \
858
+ --per_device_eval_batch_size $EVAL_BSZ \
859
+ --gradient_accumulation_steps $GA \
860
+ --learning_rate 0.0003 \
861
+ --num_train_epochs 10 \
862
+ --run_name gen_script_long_order3_t5_small_specroute_v3 \
863
+ --max_source_length 512 \
864
+ --max_target_length 50 \
865
+ --generation_max_length 50 \
866
+ --add_task_name False \
867
+ --add_dataset_name False \
868
+ --overwrite_output_dir \
869
+ --overwrite_cache \
870
+ --lr_scheduler_type constant \
871
+ --warmup_steps 0 \
872
+ --logging_strategy steps \
873
+ --logging_steps 10 \
874
+ --metric_for_best_model eval_exact_match_for_wic \
875
+ --evaluation_strategy epoch \
876
+ --save_strategy epoch \
877
+ --save_total_limit 1 \
878
+ --load_best_model_at_end \
879
+ --lora_r 8 \
880
+ --lora_alpha 32 \
881
+ --lora_dropout 0.0 \
882
+ --data_replay_freq -1 \
883
+ --mlp_hidden_dim 100 \
884
+ --model_name specroute \
885
+ --target_routing_alpha 0.8 \
886
+ --gen_data_dir CL_Benchmark \
887
+ --threshold 0.995 \
888
+ --transthreshold 0.995 \
889
+ $FP16_FLAG
890
+
891
+ rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/15-wic/checkpoint*
892
+
893
+ sleep 5
improve_gainlora/src/run_t5.py CHANGED
@@ -164,6 +164,14 @@ class ModelArguments:
164
  },
165
  )
166
 
 
 
 
 
 
 
 
 
167
  run_single: bool = field(
168
  default=False,
169
  metadata={
@@ -489,6 +497,7 @@ def main():
489
  'lora_alpha': model_args.lora_alpha,
490
  'lora_dropout': model_args.lora_dropout,
491
  'training_bias': model_args.training_bias,
 
492
  }
493
 
494
  if training_args.model_name in ['inflora', 'olora']:
@@ -1021,6 +1030,10 @@ def main():
1021
 
1022
  trainer.model.encoder.is_inference = True
1023
 
 
 
 
 
1024
  # Collect attention weights for GainLoRA KL replay (not needed for SpecRoute)
1025
  if training_args.model_name != 'specroute':
1026
  _ = trainer.predict(
 
164
  },
165
  )
166
 
167
+ target_routing_alpha: Optional[float] = field(
168
+ default=0.8,
169
+ metadata={
170
+ "help": "Target softmax routing weight for current task during training (SpecRoute V3). "
171
+ "Adaptive bias = T*ln(alpha*n_old/(1-alpha)). Set 0 to use fixed training_bias."
172
+ },
173
+ )
174
+
175
  run_single: bool = field(
176
  default=False,
177
  metadata={
 
497
  'lora_alpha': model_args.lora_alpha,
498
  'lora_dropout': model_args.lora_dropout,
499
  'training_bias': model_args.training_bias,
500
+ 'target_routing_alpha': model_args.target_routing_alpha,
501
  }
502
 
503
  if training_args.model_name in ['inflora', 'olora']:
 
1030
 
1031
  trainer.model.encoder.is_inference = True
1032
 
1033
+ # SpecRoute V3: precompute SVD of current task's LoRA for symmetric inference routing
1034
+ if training_args.model_name == 'specroute' and hasattr(trainer.model.encoder, 'prepare_inference_routing'):
1035
+ trainer.model.encoder.prepare_inference_routing()
1036
+
1037
  # Collect attention weights for GainLoRA KL replay (not needed for SpecRoute)
1038
  if training_args.model_name != 'specroute':
1039
  _ = trainer.predict(
improve_gainlora/src/t5_specroute.py CHANGED
@@ -183,10 +183,16 @@ class T5Stack(T5PreTrainedModel):
183
  # Spectral signatures loaded from previous tasks' saved weights
184
  self.spectral_signatures = [] # List[dict] — one dict per old task
185
  self.routing_temperature = prompt_config.get('attn_temperature', 1.0)
186
- # Training bias: ensures current task gets adequate routing weight
187
- # during training (compensates for A-based fit being lower than
188
- # SVD-based fits of established old tasks). Set to 0 at inference.
189
- self.training_bias = prompt_config.get('training_bias', 1.0)
 
 
 
 
 
 
190
 
191
  # For inference logging
192
  self.all_attn_weights = []
@@ -210,8 +216,13 @@ class T5Stack(T5PreTrainedModel):
210
  """
211
  Compute routing weights using spectral projection fits.
212
 
213
- For each task, measures how much of the input's energy falls into that
214
- task's operating subspace (defined by SVD of its LoRA weights).
 
 
 
 
 
215
 
216
  Args:
217
  avg_inputs_embeds: (B, 1, d_model) — averaged input token embeddings
@@ -224,29 +235,64 @@ class T5Stack(T5PreTrainedModel):
224
 
225
  fits = []
226
 
227
- # 1. Current task fit: use A rows directly (not SVD of B@A).
228
- # Motivation: At training start B=0, so SVD(B@A) gives σ≈0 fit≈0
229
- # routing weight≈0 gradient≈0 B can't learn (cold-start problem).
230
- # A rows define the input subspace available to the current task
231
- # (post null-space projection). This gives a stable, non-zero signal
232
- # from initialization, measuring "how much of h's energy falls into
233
- # the current task's available input subspace".
234
- # Formula: fit_cur(h) = Σ_i (a_i · h)² / (r · ||h||²)
235
- current_fits_layers = []
236
- for block in self.block:
237
- attn = block.layer[0].SelfAttention
238
- for lora in [attn.lora_q, attn.lora_v]:
239
- A = lora.lora_A.data.float() # (r, d_model) — frozen
240
- r = lora.r
241
- A_h = A.to(h.device, dtype=h.dtype)
242
- proj = torch.matmul(h, A_h.T) # (B, 1, r)
243
- fit = (proj ** 2).sum(dim=-1) / (r * h_norm_sq) # (B, 1)
244
- current_fits_layers.append(fit)
245
- current_fit = torch.stack(current_fits_layers).mean(dim=0) # (B, 1)
246
- # Training bias: boost current task during training (β>0 during train, 0 at inference)
247
- if self.training and hasattr(self, 'training_bias'):
248
- current_fit = current_fit + self.training_bias
249
- fits.append(current_fit)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
250
 
251
  # 2. Previous tasks fit: use spectral signatures (V, sigma)
252
  for sig_dict in self.spectral_signatures:
@@ -278,6 +324,31 @@ class T5Stack(T5PreTrainedModel):
278
 
279
  return weights.unsqueeze(2) # (B, n_tasks, 1)
280
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
281
  def parallelize(self, device_map=None):
282
  self.device_map = (
283
  get_device_map(len(self.block), range(torch.cuda.device_count()))
 
183
  # Spectral signatures loaded from previous tasks' saved weights
184
  self.spectral_signatures = [] # List[dict] — one dict per old task
185
  self.routing_temperature = prompt_config.get('attn_temperature', 1.0)
186
+
187
+ # Adaptive training bias: β = T·ln(α·n_old/(1−α))
188
+ # Ensures current task gets consistent routing weight regardless
189
+ # of total number of tasks (fixes softmax dilution with constant bias).
190
+ # At inference, uses symmetric SVD-based routing instead (no bias needed).
191
+ self._target_routing_alpha = prompt_config.get('target_routing_alpha', 0.8)
192
+
193
+ # Precomputed SVD of current task's LoRA for symmetric inference routing.
194
+ # Populated by prepare_inference_routing() before prediction.
195
+ self._current_task_svd = None
196
 
197
  # For inference logging
198
  self.all_attn_weights = []
 
216
  """
217
  Compute routing weights using spectral projection fits.
218
 
219
+ Training mode: A-row fit + adaptive bias for current task (cold-start safe).
220
+ Inference mode: SVD-based fit for ALL tasks (symmetric, no bias needed).
221
+
222
+ The adaptive bias β = T·ln(α·n_old/(1−α)) ensures the current task gets
223
+ routing weight ≈ α regardless of the number of previous tasks, compensating
224
+ for softmax dilution. At inference, all tasks use the same σ²-weighted
225
+ Rayleigh quotient (SVD fit), eliminating the train-test asymmetry.
226
 
227
  Args:
228
  avg_inputs_embeds: (B, 1, d_model) — averaged input token embeddings
 
235
 
236
  fits = []
237
 
238
+ if self.training:
239
+ # === TRAINING MODE: A-row fit + adaptive bias for current task ===
240
+ # A rows give stable non-zero signal even when B=0 (cold-start).
241
+ current_fits_layers = []
242
+ for block in self.block:
243
+ attn = block.layer[0].SelfAttention
244
+ for lora in [attn.lora_q, attn.lora_v]:
245
+ A = lora.lora_A.data.float() # (r, d_model) — frozen
246
+ r = lora.r
247
+ A_h = A.to(h.device, dtype=h.dtype)
248
+ proj = torch.matmul(h, A_h.T) # (B, 1, r)
249
+ fit = (proj ** 2).sum(dim=-1) / (r * h_norm_sq) # (B, 1)
250
+ current_fits_layers.append(fit)
251
+ current_fit = torch.stack(current_fits_layers).mean(dim=0) # (B, 1)
252
+
253
+ # Adaptive training bias: β = T·ln(α·n_old/(1−α))
254
+ n_old = len(self.spectral_signatures)
255
+ if n_old > 0:
256
+ alpha = self._target_routing_alpha
257
+ beta = self.routing_temperature * math.log(alpha * n_old / (1.0 - alpha))
258
+ current_fit = current_fit + beta
259
+ fits.append(current_fit)
260
+ else:
261
+ # === INFERENCE MODE: SVD-based fit for current task (symmetric) ===
262
+ # After training, B≠0 so SVD(B@A) gives meaningful signatures.
263
+ # Using the same σ²-weighted formula as old tasks eliminates the
264
+ # A-row vs SVD measurement asymmetry.
265
+ if self._current_task_svd is not None:
266
+ cur_fits = []
267
+ for key, sig_data in self._current_task_svd.items():
268
+ if not key.startswith('enc.'):
269
+ continue
270
+ V = sig_data['V'].to(h.device, dtype=h.dtype)
271
+ sigma = sig_data['sigma'].to(h.device, dtype=h.dtype)
272
+ proj = torch.matmul(h, V.T)
273
+ sigma_sq = sigma ** 2
274
+ sigma_sq_sum = sigma_sq.sum() + 1e-8
275
+ weighted_proj = (proj ** 2 * sigma_sq.unsqueeze(0).unsqueeze(0)).sum(dim=-1)
276
+ fit = weighted_proj / (sigma_sq_sum * h_norm_sq)
277
+ cur_fits.append(fit)
278
+ if cur_fits:
279
+ current_fit = torch.stack(cur_fits).mean(dim=0)
280
+ else:
281
+ current_fit = torch.zeros(h.shape[0], 1, device=h.device, dtype=h.dtype)
282
+ else:
283
+ # Fallback: A-row fit without bias (backward compat)
284
+ current_fits_layers = []
285
+ for block in self.block:
286
+ attn = block.layer[0].SelfAttention
287
+ for lora in [attn.lora_q, attn.lora_v]:
288
+ A = lora.lora_A.data.float()
289
+ r = lora.r
290
+ A_h = A.to(h.device, dtype=h.dtype)
291
+ proj = torch.matmul(h, A_h.T)
292
+ fit = (proj ** 2).sum(dim=-1) / (r * h_norm_sq)
293
+ current_fits_layers.append(fit)
294
+ current_fit = torch.stack(current_fits_layers).mean(dim=0)
295
+ fits.append(current_fit)
296
 
297
  # 2. Previous tasks fit: use spectral signatures (V, sigma)
298
  for sig_dict in self.spectral_signatures:
 
324
 
325
  return weights.unsqueeze(2) # (B, n_tasks, 1)
326
 
327
+ def prepare_inference_routing(self):
328
+ """
329
+ Precompute SVD of current task's LoRA for symmetric inference routing.
330
+
331
+ Must be called after training (B≠0) and before prediction. Computes
332
+ spectral signatures from the current (most recently trained) task's
333
+ LoRA weights, enabling the same σ²-weighted Rayleigh quotient formula
334
+ for all tasks during inference.
335
+ """
336
+ sigs = {}
337
+ with torch.no_grad():
338
+ for j, block in enumerate(self.block):
339
+ attn = block.layer[0].SelfAttention
340
+ for name, lora in [('q', attn.lora_q), ('v', attn.lora_v)]:
341
+ A = lora.lora_A.data.float()
342
+ B = lora.lora_B.data.float()
343
+ r = lora.r
344
+ S, Vt = _thin_svd_low_rank(B, A)
345
+ sigs[f'enc.{j}.self.{name}'] = {
346
+ 'V': Vt[:r].cpu(),
347
+ 'sigma': S[:r].cpu()
348
+ }
349
+ self._current_task_svd = sigs
350
+ print(f"[SpecRoute] Prepared inference routing: {len(sigs)} encoder SVD signatures for current task")
351
+
352
  def parallelize(self, device_map=None):
353
  self.device_map = (
354
  get_device_map(len(self.block), range(torch.cuda.device_count()))
results/experiment_versions.md CHANGED
@@ -166,19 +166,110 @@ ROOT GainLoRA giải quyết vấn đề này nhờ trans_input MLP map input m
166
  - threshold/transthreshold: 0.980 (kept from previous)
167
 
168
  ### Kết quả
169
- > *Chưa chạy — cần thực nghiệm*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
 
171
  ### Kỳ vọng
172
- - Cold-start fix → tasks sau (8+) nhận routing weight đủ lớn (>30%) B có thể học
173
- - Threshold 0.980sentiment tasks (imdb, sst2) null-space capacity cho learning
174
- - Training bias β=1.0 → current task dominant trong routing khi training, không ảnh hưởng inference
175
- - Fair BSZ (effective=32 = ROOT) → so sánh AP trực tiếp
176
- - Overall AP: kỳ vọng >50 (V1=39.74), mục tiêu tiếp cận ROOT=59.70
177
 
178
- ### Nếu kết quả không đạt → V3 Plan
179
- - **V3a**: Adaptive training bias β per-task (higher for later tasks khi null-space nhỏ hơn)
180
- - **V3b**: Adaptive threshold per-layer (layers gần output cần threshold thấp hơn)
181
- - **V3c**: Warm-start: initialize B_new từ weighted combination of old B vectors (zero-replay compliant)
182
 
183
  ---
184
 
@@ -238,3 +329,5 @@ V2 đã tắt replay (`data_replay_freq=-1`), match ROOT. Runtime ước tính n
238
  | 2025-XX-XX | V2.0 (hủy) | ~~Replay~~ | ~~Thêm experience replay~~ — **BỊ HỦY** do vi phạm zero-replay constraint |
239
  | 2025-XX-XX | V2.0 | Bug fix + Fair | A-row routing (fix cold-start), training bias β=1.0, threshold 0.980, fair BSZ=32 |
240
  | 2025-XX-XX | V2.1 | Perf Optimization | Thin QR+SVD (~186× speedup per SVD, zero accuracy loss) |
 
 
 
166
  - threshold/transthreshold: 0.980 (kept from previous)
167
 
168
  ### Kết quả
169
+
170
+ | # | Task | ROOT EM | V2 EM | Δ | Ghi chú |
171
+ |---|------|---------|-------|---|---------|
172
+ | 1 | yelp | 56.01 | 35.91 | -20.10 | Below |
173
+ | 2 | amazon | 52.05 | 36.58 | -15.47 | Below |
174
+ | 3 | mnli | 34.07 | 0.25 | -33.82 | Catastrophic forgetting (peak 31.25 ep8) |
175
+ | 4 | cb | 3.57 | 0.00 | -3.57 | EM=0 — misrouted garbage output |
176
+ | 5 | copa | 42.00 | **47.00** | **+5.00** | ✅ Better |
177
+ | 6 | qqp | 76.96 | **77.03** | **+0.07** | ✅ Tie |
178
+ | 7 | rte | 45.85 | 0.36 | -45.49 | Catastrophic forgetting (peak 51.26 ep4) |
179
+ | 8 | imdb | 89.51 | 0.00 | -89.51 | ❌ EM=0 — pred "positive"/"negative" vs label "Good"/"Bad" |
180
+ | 9 | sst2 | 85.21 | 0.00 | -85.21 | ❌ EM=0 — pred "negative" vs label "Bad" |
181
+ | 10 | dbpedia | 98.16 | 71.95 | -26.21 | Below |
182
+ | 11 | agnews | 88.37 | 68.21 | -20.16 | Below |
183
+ | 12 | yahoo | 57.28 | 6.82 | -50.46 | Very low |
184
+ | 13 | multirc | 50.52 | **55.42** | **+4.90** | ✅ Better |
185
+ | 14 | boolq | 60.43 | **61.44** | **+1.01** | ✅ Better |
186
+ | 15 | wic | 55.49 | 0.00 | -55.49 | ❌ EM=0 — pred "the same meaning" vs label "True" |
187
+ | | **AP(EM)** | **59.70** | **30.73** | **-28.97** | |
188
+ | | **AP(rougeL)** | **61.66** | **38.00** | **-23.66** | |
189
+
190
+ ### Phân tích chi tiết
191
+
192
+ **Nhóm 1: EM=0 do MISROUTING (4 tasks)**
193
+ - imdb pred "positive"/"negative" → đây là label vocabulary của yelp/amazon → routing gửi input imdb đến LoRA cũ
194
+ - sst2 pred "negative" → tương tự, routed to yelp LoRA
195
+ - wic pred "the same meaning"/"different" → label đúng là "True"/"False" → routed to wrong expert
196
+ - cb pred gibberish ("bedroom", "virtuous") → completely misrouted
197
+
198
+ **Nhóm 2: Catastrophic forgetting (2 tasks)**
199
+ - mnli: EM peak=31.25 tại ep8, nhưng final=0.25 → degenerate (always "neutral")
200
+ - rte: EM peak=51.26 tại ep4, final=0.36 → overfit rồi collapse
201
+
202
+ **ROOT CAUSE: Constant β=1.0 không scale theo số task**
203
+
204
+ | n_tasks | Training w_cur | Inference w_cur | Gap |
205
+ |---------|---------------|----------------|-----|
206
+ | 1 | 100% | 100% | 1.0x |
207
+ | 2 | 71.5% | 48.0% | 1.5x |
208
+ | 8 (imdb) | **26.4%** | **11.7%** | 2.3x |
209
+ | 15 (wic) | **15.2%** | **6.2%** | 2.5x |
210
+
211
+ Task 8 (imdb) chỉ nhận 26.4% routing weight khi training → 73.6% gradient đi qua LoRA cũ → model học label vocabulary của task cũ thay vì task hiện tại.
212
+
213
+ **SECONDARY CAUSE: A-row fit vs SVD fit asymmetry**
214
+ - Training: current task dùng A-row fit (uniform weighting)
215
+ - Inference: current task VẪN dùng A-row fit (không có bias) nhưng old tasks dùng SVD fit (σ²-weighted)
216
+ - SVD fit hệ thống cao hơn A-row fit → old tasks luôn thắng routing tại inference
217
+
218
+ ### Kỳ vọng
219
+ - Cold-start fix → giải quyết EM=0 ở task 1-3 ✅
220
+ - Training bias β=1.0 → chỉ đủ cho ≤3 tasks, KHÔNG đủ cho 8+ tasks ❌
221
+
222
+ ---
223
+
224
+ ## Version 3.0 — SpecRoute V3: Adaptive Bias + Symmetric Inference Routing
225
+
226
+ ### Thay đổi về Methodology (CẬP NHẬT SPECROUTE_IDEA.md)
227
+
228
+ **1. Adaptive Training Bias (thay thế constant β=1.0):**
229
+
230
+ $$\beta(n) = \tau \cdot \ln\!\left(\frac{\alpha_{\mathrm{target}} \cdot n}{1 - \alpha_{\mathrm{target}}}\right)$$
231
+
232
+ - $n$ = số old tasks = `len(spectral_signatures)`
233
+ - $\alpha_{\mathrm{target}}$ = target routing weight (default 0.8)
234
+ - Đảm bảo w_cur ≈ 80% bất kể tổng số task
235
+ - Derivation từ giải phương trình softmax: xem SPECROUTE_IDEA.md Section C2
236
+
237
+ **2. Symmetric Inference Routing (thay thế A-row fit tại inference):**
238
+ - Sau training, B≠0 → SVD(B@A) cho meaningful signatures
239
+ - Gọi `prepare_inference_routing()` trước prediction
240
+ - Inference: TẤT CẢ tasks (kể cả current) dùng cùng σ²-weighted Rayleigh quotient
241
+ - Loại bỏ hoàn toàn asymmetry A-row vs SVD → measurement symmetry
242
+
243
+ **3. Threshold 0.995 (match ROOT, thay vì 0.980):**
244
+ - Bảo toàn null-space capacity cho tasks sau
245
+ - Capacity: d/(r·(1-ε)) = 512/(8·0.005) = 12,800 tasks (rất dư)
246
+
247
+ ### Code Changes
248
+
249
+ **`t5_specroute.py`:**
250
+ - `compute_spectral_routing()`:
251
+ - Training: A-row fit + β(n) tự động từ len(spectral_signatures)
252
+ - Inference: dùng `_current_task_svd` (SVD-based fit) cho current task
253
+ - Thêm `prepare_inference_routing()`: tính SVD(B@A) cho current task's LoRA
254
+ - Thêm `_target_routing_alpha` config parameter
255
+ - Xóa `training_bias` cố định
256
+
257
+ **`run_t5.py`:**
258
+ - Thêm `target_routing_alpha` argument (default 0.8)
259
+ - Gọi `model.encoder.prepare_inference_routing()` trước inference
260
+
261
+ **Shell script: `T5_small/gen_script_long_order3_t5_small_specroute_v3.sh`:**
262
+ - `--target_routing_alpha 0.8` (thay `--training_bias 1.0`)
263
+ - `--threshold 0.995` (thay 0.980)
264
+ - `--transthreshold 0.995` (thay 0.980)
265
 
266
  ### Kỳ vọng
267
+ - Adaptive bias → tasks 8+ nhận ≈80% routing weight → có thể học đúng label vocabulary
268
+ - Symmetric inferencerouting chính xác hơn tại eval EM>0 cho imdb/sst2/wic
269
+ - Threshold 0.995bảo vệ tốt hơn + routing margin lớn hơn (Theorem 1)
 
 
270
 
271
+ ### Kết quả
272
+ > *Chưa chạy cần thực nghiệm*
 
 
273
 
274
  ---
275
 
 
329
  | 2025-XX-XX | V2.0 (hủy) | ~~Replay~~ | ~~Thêm experience replay~~ — **BỊ HỦY** do vi phạm zero-replay constraint |
330
  | 2025-XX-XX | V2.0 | Bug fix + Fair | A-row routing (fix cold-start), training bias β=1.0, threshold 0.980, fair BSZ=32 |
331
  | 2025-XX-XX | V2.1 | Perf Optimization | Thin QR+SVD (~186× speedup per SVD, zero accuracy loss) |
332
+ | 2026-03-17 | V2.0 | **Results** | AP(EM)=30.73 vs ROOT=59.70. 4 tasks EM=0 (imdb/sst2/wic/cb misrouting), 2 catastrophic forgetting |
333
+ | 2026-03-17 | V3.0 | **Methodology** | Adaptive bias β(n)=τ·ln(α·n/(1-α)), symmetric SVD inference routing, threshold→0.995 |
results/specroute_v2_diagnosis.md ADDED
@@ -0,0 +1,174 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SpecRoute V2 → V3: Chẩn đoán toàn diện & Kế hoạch khắc phục
2
+
3
+ ## 1. Tổng quan kết quả V2
4
+
5
+ | Metric | SpecRoute V2 | ROOT (GainLoRA-InfLoRA) | Gap |
6
+ |--------|-------------|------------------------|-----|
7
+ | AP(EM) | 30.73 | 59.70 | **-28.97** |
8
+ | AP(rougeL) | 38.00 | 61.66 | **-23.66** |
9
+
10
+ ### Phân loại thất bại chi tiết
11
+
12
+ **Nhóm 1: EM = 0 suốt quá trình training (LABEL FORMAT MISMATCH)**
13
+ | Task | Pos | Prediction | Ground Truth | Nguyên nhân |
14
+ |------|-----|-----------|-------------|-------------|
15
+ | imdb | 8 | "positive"/"negative" | "Good"/"Bad" | Misrouted → yelp/amazon LoRA |
16
+ | sst2 | 9 | "negative" | "Bad" | Misrouted → yelp/amazon LoRA |
17
+ | wic | 15 | "the same meaning"/"different" | "True"/"False" | Misrouted → unknown LoRA |
18
+ | cb | 4 | "bedroom"/"yes"/"virtuous"/gibberish | "entailment"/"neutral"/"contradiction" | Misrouted → unrelated LoRA |
19
+
20
+ **Nhóm 2: Catastrophic Forgetting**
21
+ | Task | Best EM | Final EM | Nguyên nhân |
22
+ |------|---------|----------|-------------|
23
+ | mnli | 31.25 (ep8) | 0.25 | Degenerate → always "neutral" |
24
+ | rte | 51.26 (ep4) | 0.36 | Degenerate → always "entailment" |
25
+
26
+ **Nhóm 3: Hoạt động bình thường**
27
+ | Task | SpecRoute | ROOT | Status |
28
+ |------|-----------|------|--------|
29
+ | copa | **47.00** | 42.00 | ✅ Better (+5.0) |
30
+ | multirc | **55.42** | 50.52 | ✅ Better (+4.9) |
31
+ | boolq | **61.44** | 60.43 | ✅ Better (+1.0) |
32
+ | qqp | **77.03** | 76.96 | ✅ Tie |
33
+
34
+ ---
35
+
36
+ ## 2. Chẩn đoán nguyên nhân gốc (Root Cause Analysis)
37
+
38
+ ### 2.1 BUG CHỦ YẾU: Constant Training Bias Không Scale Theo Số Task
39
+
40
+ **Công thức hiện tại:**
41
+ $$\text{fit}_{\text{cur}} = \frac{1}{L}\sum_\ell \frac{\sum_i (a_i \cdot h)^2}{r \|h\|^2} + \beta \quad (\beta = 1.0 \text{ cố định})$$
42
+
43
+ **Softmax routing weight cho current task:**
44
+ $$w_{\text{cur}} = \frac{e^{(\text{fit}_{\text{cur}})/T}}{e^{(\text{fit}_{\text{cur}})/T} + (n-1) \cdot e^{\text{fit}_{\text{old}}/T}}$$
45
+
46
+ Với $\beta = 1.0$, $T = 1.0$, fit_raw ≈ 0.12, fit_old ≈ 0.20:
47
+
48
+ | n_tasks | Training $w_{\text{cur}}$ | Inference $w_{\text{cur}}$ |
49
+ |---------|--------------------------|---------------------------|
50
+ | 1 | 100% | 100% |
51
+ | 2 | 71.5% | 48.0% |
52
+ | 5 | 38.5% | 18.8% |
53
+ | **8 (imdb)** | **26.4%** | **11.7%** |
54
+ | 10 | 21.8% | 9.3% |
55
+ | **15 (wic)** | **15.2%** | **6.2%** |
56
+
57
+ **Hậu quả:**
58
+ - Task 8 (imdb): Chỉ 26.4% routing weight khi training → 73.6% gradient signal đi qua LoRA cũ → model không thể học label "Good"/"Bad"
59
+ - Task 15 (wic): Chỉ 15.2% routing weight → gần như không học được gì
60
+
61
+ **So sánh với ROOT:** ROOT dùng sigmoid độc lập cho mỗi task: $w_k = |2\sigma(4\cos(x, \text{key}_k)) - 1|$. Không có zero-sum competition → mỗi task có thể đạt weight ~0.8 bất kể số task.
62
+
63
+ ### 2.2 BUG THỨ HAI: Bất đối xứng A-row fit vs SVD fit (Train-Test Gap)
64
+
65
+ **Training:** $\text{fit}_{\text{cur}} = \text{A-row fit} + \beta$
66
+ **Inference:** $\text{fit}_{\text{cur}} = \text{A-row fit}$ (no bias)
67
+
68
+ Hai formula đo fit trên **hai thang đo khác nhau**:
69
+ - **A-row fit** (current task): $\frac{\sum_i (a_i \cdot h)^2}{r \|h\|^2}$ — uniform weighting
70
+ - **SVD fit** (old tasks): $\frac{\sum_i \sigma_i^2 (v_i \cdot h)^2}{\sum_i \sigma_i^2 \|h\|^2}$ — $\sigma^2$-weighted
71
+
72
+ Sau null-space projection, A rows bị constrained vào subspace hẹp → A-row fit **hệ thống thấp hơn** SVD fit → old tasks luôn thắng routing tại inference.
73
+
74
+ ### 2.3 BUG THỨ BA: Threshold quá thấp (0.980 vs ROOT 0.995)
75
+
76
+ - threshold = 0.980 → mỗi task chiếm **nhiều hơn** null-space
77
+ - Sau 7 tasks: null-space còn lại cho task 8 rất hẹp
78
+ - A_8 rows bị project vào null-space nhỏ → A-row fit cực thấp
79
+ - ROOT dùng 0.995 → mỗi task chiếm ít null-space hơn → duy trì capacity cho tasks sau
80
+
81
+ ---
82
+
83
+ ## 3. Phân tích lý thuyết (Theory-Backed)
84
+
85
+ ### 3.1 Softmax Competition Bias (Information Theory)
86
+
87
+ Softmax + constant bias vi phạm **principle of maximum entropy** khi số task tăng. Với $n$ tasks và fit gần nhau, softmax converge về phân phối uniform $1/n$ (maximum entropy). Constant bias $\beta$ không đủ để chống lại entropy này khi $n$ lớn.
88
+
89
+ **Adaptive bias derivation:** Muốn current task đạt weight $\alpha$ cố định:
90
+ $$\alpha = \frac{e^{(f + \beta)/T}}{e^{(f + \beta)/T} + (n-1)e^{f/T}}$$
91
+
92
+ Giải cho $\beta$:
93
+ $$\boxed{\beta = T \cdot \ln\left(\frac{\alpha(n-1)}{1-\alpha}\right)}$$
94
+
95
+ Với $\alpha = 0.8$, $T = 1.0$:
96
+ - n=2: $\beta$ = 1.39 → w = 80%
97
+ - n=8: $\beta$ = 3.33 → w = 80%
98
+ - n=15: $\beta$ = 4.03 → w = 80%
99
+
100
+ **Kết nối paper**: Tương tự "bias correction" trong Adam optimizer — bias phải thay đổi theo thời gian để duy trì tính chất thống kê mong muốn.
101
+
102
+ ### 3.2 Rayleigh Quotient Symmetry (Linear Algebra)
103
+
104
+ Fit formula hiện tại vi phạm **measurement symmetry**: current task và old tasks dùng metric khác nhau. Trong Grassmannian geometry, khoảng cách giữa hai subspace phải dùng cùng một metric.
105
+
106
+ **Weighted Rayleigh quotient** (chuẩn cho cả hai):
107
+ $$\text{fit}_k(h) = \frac{\sum_{i=1}^r \sigma_{k,i}^2 (v_{k,i} \cdot h)^2}{\sum_{i=1}^r \sigma_{k,i}^2 \cdot \|h\|^2}$$
108
+
109
+ Tại inference, current task cũng phải dùng SVD-based fit (SVD có sẵn vì B ≠ 0 sau training).
110
+
111
+ **Kết nối paper**: Principal angle theory (Björck & Golub, 1973) — khoảng cách giữa subspaces phải đo bằng canonical angles, tương đương $\sigma$-weighted Rayleigh quotient.
112
+
113
+ ### 3.3 Null-Space Capacity Bound (GPM Theory)
114
+
115
+ Từ GPM (Saha et al., 2021): với threshold $\tau$, mỗi task chiếm $\leq r(1-\tau)$ dimensions. Capacity cho $n$ tasks:
116
+
117
+ $$n_{\max} = \left\lfloor \frac{d}{r(1-\tau)} \right\rfloor$$
118
+
119
+ Với d=512, r=8:
120
+ - $\tau$ = 0.995 → capacity = 512/(8×0.005) = **12,800 tasks** (rất dư)
121
+ - $\tau$ = 0.980 → capacity = 512/(8×0.020) = **3,200 tasks** (vẫn dư nhưng aggressive hơn)
122
+
123
+ Threshold 0.995 bảo vệ nhiều capacity hơn cho tasks sau.
124
+
125
+ ---
126
+
127
+ ## 4. Kế hoạch sửa: SpecRoute V3
128
+
129
+ ### Fix 1: Adaptive Training Bias (Bắt buộc)
130
+ ```
131
+ β = T · ln(α · (n_old) / (1-α)) khi n_old ≥ 1
132
+ β = 0 khi n_old = 0
133
+ ```
134
+ - `n_old = len(self.spectral_signatures)` — tự động từ số signatures đã load
135
+ - `α = target_routing_alpha` — config parameter, default 0.8
136
+ - Đảm bảo w_cur ≈ 80% bất kể số task
137
+
138
+ ### Fix 2: Symmetric Inference Routing (Bắt buộc)
139
+ - **Training**: Giữ A-row fit + adaptive bias (cold-start compatible)
140
+ - **Inference**: Tính SVD(B@A) cho current task → dùng SVD fit cho TẤT CẢ tasks
141
+ - Method: `prepare_inference_routing()` — gọi 1 lần trước inference
142
+ - Loại bỏ hoàn toàn asymmetry A-row vs SVD
143
+
144
+ ### Fix 3: Threshold = 0.995 (Match ROOT)
145
+ - Chỉ thay đổi trong shell script
146
+ - Giảm null-space consumption per task
147
+ - Bảo toàn capacity cho tasks sau
148
+
149
+ ### Không thêm gì khác
150
+ - Không thêm KL replay (vi phạm zero-replay settings)
151
+ - Không thêm learned routing parameters (mất novelty parameter-free)
152
+ - Không thay đổi optimizer/lr/scheduler
153
+ - Tôn trọng nguyên tắc: "chỉ cải thiện implement, không over-engineer"
154
+
155
+ ---
156
+
157
+ ## 5. Code Changes Cụ Thể
158
+
159
+ ### File 1: `t5_specroute.py`
160
+ 1. Thêm `prepare_inference_routing()` method vào T5Stack
161
+ 2. Sửa `compute_spectral_routing()`:
162
+ - Training: A-row fit + `adaptive_training_bias` (computed from α and n_old)
163
+ - Inference: SVD fit từ `_current_task_svd` (precomputed)
164
+ 3. Thêm property `adaptive_training_bias`
165
+
166
+ ### File 2: `run_t5.py`
167
+ 1. Thêm `target_routing_alpha` vào prompt_config
168
+ 2. Gọi `model.encoder.prepare_inference_routing()` trước inference
169
+
170
+ ### File 3: `gen_script_long_order3_t5_small_specroute_v3.sh`
171
+ 1. `--threshold 0.995`
172
+ 2. `--transthreshold 0.995`
173
+ 3. `--target_routing_alpha 0.8`
174
+ 4. Output dir: `specroute_v3`