SpecRoute V3: adaptive bias, symmetric inference, threshold 0.995, batch size optimization

V3 fixes (3 targeted):
- Adaptive training bias: beta=T*ln(alpha*n/(1-alpha)), ensures ~80% routing weight
- Symmetric inference routing: prepare_inference_routing() computes SVD for current task
- Threshold 0.995: matches ROOT GPM capacity

Batch size optimization:
- T4_1gpu: BSZ 4->8, GA 8->4 (effective=32, ~2x fewer micro-batches)
- T4_2gpu: BSZ 2->4, GA 8->4 (effective=32)
- A100: BSZ 8->32, GA 4->1 (effective=32)

Docs: V2 diagnosis report, V3 methodology in SPECROUTE_IDEA.md, experiment_versions.md

Files changed (7) hide show

.gitignore +1 -1
improve_gainlora/SPECROUTE_IDEA.md +20 -9
improve_gainlora/T5_small/gen_script_long_order3_t5_small_specroute_v3.sh +893 -0
improve_gainlora/src/run_t5.py +13 -0
improve_gainlora/src/t5_specroute.py +100 -29
results/experiment_versions.md +103 -10
results/specroute_v2_diagnosis.md +174 -0

.gitignore CHANGED Viewed

@@ -14,4 +14,4 @@ logs/*
 logs
 *.log
 human_working_IdeaMethod_and_discuss/

 logs
 *.log
 human_working_IdeaMethod_and_discuss/
+running_H100.txt

improve_gainlora/SPECROUTE_IDEA.md CHANGED Viewed

@@ -185,17 +185,27 @@ $$B_t,\, A_t \;\xrightarrow[\text{QR + SVD}]{O(dr^2)}\; (V_t,\, \boldsymbol{\sig
 ### C2 — Spectral Affinity Routing
-**Inference** (all tasks available):
 $$w(h) = \mathrm{softmax}\!\left(\frac{[\alpha_1(h),\; \ldots,\; \alpha_T(h)]}{\tau}\right)$$
 **Training** (task $t$, final SVD unknown because $B_t$ still training):
-$$\alpha_t^{\mathrm{train}}(h) = \frac{\|A_t\, h\|^2}{r\,\|h\|^2} + \beta$$
 **Justification of the A-row proxy:** For any full-rank $B_t$, the column span of $V_t$ (from SVD of $B_t A_t$) equals $\mathrm{range}(A_t^\top)$. So the A rows span the *same* input subspace that the converged $V_t$ will capture. The proxy measures input alignment with this subspace using uniform weighting (no $\sigma$ available yet).
-**Justification of $\beta$:** A rows (kaiming-initialised, unit-variance) produce systematically lower fits than $\sigma^2$-weighted SVD fits of trained old experts. Setting $\beta = 1.0$ makes the softmax produce $w_t > 0.95$, approximating the oracle assignment $w_t = 1$ (principled: during training on task $t$'s data, the optimal routing *is* $w_t = 1$) while allowing marginal knowledge transfer from relevant old experts.
 ### C3 — Capacity-Aware Subspace Allocation
@@ -240,8 +250,9 @@ where $\varepsilon_0$ is the base threshold. This allocates incrementally strict
 | Theory | Implementation | File |
 |--------|---------------|------|
 | Spectral signature $\mathcal{S}_t$ | `compute_spectral_signatures()` (thin QR+SVD) | `t5_specroute.py` |
-| Spectral affinity $\alpha_t(h)$ (old tasks) | σ²-weighted Rayleigh quotient | `compute_spectral_routing()` |
-| A-row proxy $\alpha_t^{\mathrm{train}}$ (current) | `(proj**2).sum() / (r * h_norm_sq) + training_bias` | `compute_spectral_routing()` |
 | Routing $w = \mathrm{softmax}(\alpha / \tau)$ | `torch.softmax(fit_scores / temp)` | `compute_spectral_routing()` |
 | Drift-free input $h$ | `inputs_embeds = self.embed_tokens(input_ids)` → mean-pool | `T5Stack.forward()` |
 | GPM + InfLoRA null-space | `get_reg_matrix()` | `cl_trainer_specroute.py` |
@@ -262,8 +273,8 @@ where $\varepsilon_0$ is the base threshold. This allocates incrementally strict
 ### Task $t \geq 2$
 1. Load model + fresh LoRA; load old LoRA weights and spectral signatures.
 2. InfLoRA: project current $A_t$ into null-space of old GPM bases.
-3. Train `lora_B` with spectral affinity routing + training bias $\beta$.
-4. Post-training: compute $\mathcal{S}_t$ + update GPM bases.
 5. Save all artifacts for next task.
 ---
@@ -276,8 +287,8 @@ where $\varepsilon_0$ is the base threshold. This allocates incrementally strict
 | Benchmarks | SuperNI (15 tasks, 2 orderings), Long (15 tasks, 2 orderings) |
 | Metrics | AP (Average Performance, ↑), FT (Forgetting, ↓) |
 | LoRA | $r = 4$, $\alpha = 32$, dropout 0.0 |
-| Routing | $\tau = 1.0$, $\beta = 1.0$ (train only) |
-| ESA | $\varepsilon_0 = 0.980$ (dynamic) |
 | Precision | fp32 + gradient checkpointing |
 | Comparison | Batch size, LR, scheduler match ROOT (GainLoRA) exactly |

 ### C2 — Spectral Affinity Routing
+**Inference** (all tasks available, symmetric SVD-based routing):
 $$w(h) = \mathrm{softmax}\!\left(\frac{[\alpha_1(h),\; \ldots,\; \alpha_T(h)]}{\tau}\right)$$
+All tasks (including the most recently trained) use the same $\sigma^2$-weighted spectral affinity formula (Definition 2). After training task $t$, we compute $\mathcal{S}_t$ once via `prepare_inference_routing()` and use it alongside old tasks' signatures.
 **Training** (task $t$, final SVD unknown because $B_t$ still training):
+$$\alpha_t^{\mathrm{train}}(h) = \frac{\|A_t\, h\|^2}{r\,\|h\|^2} + \beta(n), \qquad \beta(n) = \tau \cdot \ln\!\left(\frac{\alpha_{\mathrm{target}} \cdot n}{1 - \alpha_{\mathrm{target}}}\right)$$
+where $n = |\{\text{old tasks}\}|$ and $\alpha_{\mathrm{target}} \in (0,1)$ is the desired routing weight for the current task (default 0.8).
 **Justification of the A-row proxy:** For any full-rank $B_t$, the column span of $V_t$ (from SVD of $B_t A_t$) equals $\mathrm{range}(A_t^\top)$. So the A rows span the *same* input subspace that the converged $V_t$ will capture. The proxy measures input alignment with this subspace using uniform weighting (no $\sigma$ available yet).
+**Justification of adaptive $\beta(n)$:** A constant bias $\beta_0$ causes the current task's softmax routing weight to decay as $O(1/n)$ with task count (softmax dilution). The adaptive formula normalises this: solving $w_t = \alpha_{\mathrm{target}}$ in the softmax equation yields the closed-form above. This ensures the current task receives routing weight $\approx \alpha_{\mathrm{target}}$ regardless of $n$, providing consistent gradient flow throughout the CL sequence.
+*Derivation.* Let $f = \alpha_t^{\mathrm{train}}$, $g = \bar{\alpha}_{\mathrm{old}}$ (mean old task fit). The softmax weight for the current task among $n+1$ competitors:
+$$w_t = \frac{e^{f/\tau}}{e^{f/\tau} + n\, e^{g/\tau}} = \alpha_{\mathrm{target}} \;\;\Longrightarrow\;\; \beta = f - g = \tau \cdot \ln\!\left(\frac{\alpha_{\mathrm{target}} \cdot n}{1 - \alpha_{\mathrm{target}}}\right)$$
+**Justification of symmetric inference:** During training, the A-row proxy + adaptive bias is necessary because $B_t$ is evolving (cold-start). At inference, $B_t$ is frozen and $\Delta W_t = B_t A_t$ has well-defined SVD. Using the same $\sigma^2$-weighted Rayleigh quotient for all tasks ensures *measurement symmetry* — all affinities live on the same metric space, and the Routing–Protection Duality Theorem (Theorem 1) applies uniformly.
 ### C3 — Capacity-Aware Subspace Allocation
 | Theory | Implementation | File |
 |--------|---------------|------|
 | Spectral signature $\mathcal{S}_t$ | `compute_spectral_signatures()` (thin QR+SVD) | `t5_specroute.py` |
+| Spectral affinity $\alpha_t(h)$ (all tasks at inference) | σ²-weighted Rayleigh quotient | `compute_spectral_routing()` |
+| A-row proxy $\alpha_t^{\mathrm{train}}$ (current, training only) | A-row fit + adaptive bias $\beta(n)$ | `compute_spectral_routing()` |
+| Symmetric inference SVD | `prepare_inference_routing()` → SVD of current $B_t A_t$ | `t5_specroute.py` |
 | Routing $w = \mathrm{softmax}(\alpha / \tau)$ | `torch.softmax(fit_scores / temp)` | `compute_spectral_routing()` |
 | Drift-free input $h$ | `inputs_embeds = self.embed_tokens(input_ids)` → mean-pool | `T5Stack.forward()` |
 | GPM + InfLoRA null-space | `get_reg_matrix()` | `cl_trainer_specroute.py` |
 ### Task $t \geq 2$
 1. Load model + fresh LoRA; load old LoRA weights and spectral signatures.
 2. InfLoRA: project current $A_t$ into null-space of old GPM bases.
+3. Train `lora_B` with spectral affinity routing + adaptive bias $\beta(n)$.
+4. Post-training: compute $\mathcal{S}_t$ (`prepare_inference_routing` for inference, `compute_spectral_signatures` for storage) + update GPM bases.
 5. Save all artifacts for next task.
 ---
 | Benchmarks | SuperNI (15 tasks, 2 orderings), Long (15 tasks, 2 orderings) |
 | Metrics | AP (Average Performance, ↑), FT (Forgetting, ↓) |
 | LoRA | $r = 4$, $\alpha = 32$, dropout 0.0 |
+| Routing | $\tau = 1.0$, $\alpha_{\mathrm{target}} = 0.8$, adaptive $\beta(n)$ (train); symmetric SVD (inference) |
+| ESA | $\varepsilon_0 = 0.995$ (dynamic) |
 | Precision | fp32 + gradient checkpointing |
 | Comparison | Batch size, LR, scheduler match ROOT (GainLoRA) exactly |

improve_gainlora/T5_small/gen_script_long_order3_t5_small_specroute_v3.sh ADDED Viewed

	@@ -0,0 +1,893 @@

+#!/bin/bash
+#SBATCH -J cl
+#SBATCH -o cl-%j.out
+#SBATCH -p compute
+#SBATCH -N 1
+#SBATCH -t 20:00:00
+#SBATCH --mem 128G
+#SBATCH --gres=gpu:2
+export CUDA_DEVICE_ORDER="PCI_BUS_ID"
+port=$(shuf -i25000-30000 -n1)
+# ============================================================
+# Auto-detect GPU count and type for optimal parallelism
+# ============================================================
+NUM_GPUS=$(nvidia-smi -L 2>/dev/null | wc -l)
+GPU_MEM=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits 2>/dev/null | head -1)
+if [ -z "$GPU_MEM" ]; then
+    echo "ERROR: No GPU detected!"
+    exit 1
+fi
+# Determine GPU type
+if [ "$GPU_MEM" -lt 20000 ]; then
+    IS_T4=1
+    echo "[GPU] Detected T4 GPUs (${GPU_MEM}MB VRAM each)"
+else
+    IS_T4=0
+    echo "[GPU] Detected high-memory GPUs (${GPU_MEM}MB VRAM each)"
+fi
+# Determine parallelism strategy
+if [ "$IS_T4" -eq 1 ] && [ "$NUM_GPUS" -ge 2 ]; then
+    GPU_MODE="t4_2gpu"
+    GPU_IDS="0,1"
+    FP16_FLAG=""
+    echo "[GPU] Strategy: 2x T4 DataParallel + fp32 + gradient_checkpointing"
+elif [ "$IS_T4" -eq 1 ]; then
+    GPU_MODE="t4_1gpu"
+    GPU_IDS="${1:-0}"
+    FP16_FLAG=""
+    echo "[GPU] Strategy: 1x T4 + fp32 + gradient_checkpointing"
+else
+    GPU_MODE="a100"
+    GPU_IDS="${1:-0}"
+    FP16_FLAG=""
+    echo "[GPU] Strategy: A100 (single GPU, fp32)"
+fi
+echo "[GPU] Using CUDA_VISIBLE_DEVICES=$GPU_IDS"
+echo "============================================================"
+echo ""
+if [ "$GPU_MODE" = "t4_2gpu" ]; then
+    BSZ=4; GA=4; EVAL_BSZ=128
+elif [ "$GPU_MODE" = "t4_1gpu" ]; then
+    BSZ=8; GA=4; EVAL_BSZ=128
+else
+    BSZ=32; GA=1; EVAL_BSZ=128
+fi
+CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
+   --do_train \
+   --do_predict \
+   --predict_with_generate \
+   --model_name_or_path $2 \
+   --data_dir CL_Benchmark \
+   --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
+   --task_config_dir configs/gen_script_long_order3_t5_configs/yelp \
+   --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp \
+   --per_device_train_batch_size $BSZ \
+   --per_device_eval_batch_size $EVAL_BSZ \
+   --gradient_accumulation_steps $GA \
+   --learning_rate 0.0003 \
+   --num_train_epochs 10 \
+   --run_name gen_script_long_order3_t5_small_specroute_v3 \
+   --max_source_length 512 \
+   --max_target_length 50 \
+   --generation_max_length 50 \
+   --add_task_name False \
+   --add_dataset_name False \
+   --overwrite_output_dir \
+   --overwrite_cache \
+   --lr_scheduler_type constant \
+   --warmup_steps 0 \
+   --logging_strategy steps \
+   --logging_steps 10 \
+   --metric_for_best_model eval_exact_match \
+   --evaluation_strategy epoch \
+   --save_strategy epoch \
+   --save_total_limit 1 \
+   --load_best_model_at_end \
+   --lora_r 8 \
+   --lora_alpha 32 \
+   --lora_dropout 0.0 \
+   --data_replay_freq -1 \
+   --mlp_hidden_dim 100 \
+   --model_name specroute \
+   --target_routing_alpha 0.8 \
+   --gen_data_dir CL_Benchmark \
+   --threshold 0.995 \
+   --transthreshold 0.995 \
+   $FP16_FLAG
+rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/checkpoint*
+sleep 5
+if [ "$GPU_MODE" = "t4_2gpu" ]; then
+    BSZ=4; GA=4; EVAL_BSZ=128
+elif [ "$GPU_MODE" = "t4_1gpu" ]; then
+    BSZ=8; GA=4; EVAL_BSZ=128
+else
+    BSZ=32; GA=1; EVAL_BSZ=128
+fi
+CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
+   --do_train \
+   --do_predict \
+   --predict_with_generate \
+   --model_name_or_path $2 \
+   --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights \
+   --data_dir CL_Benchmark \
+   --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
+   --task_config_dir configs/gen_script_long_order3_t5_configs/amazon \
+   --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon \
+   --per_device_train_batch_size $BSZ \
+   --per_device_eval_batch_size $EVAL_BSZ \
+   --gradient_accumulation_steps $GA \
+   --learning_rate 0.0003 \
+   --num_train_epochs 10 \
+   --run_name gen_script_long_order3_t5_small_specroute_v3 \
+   --max_source_length 512 \
+   --max_target_length 50 \
+   --generation_max_length 50 \
+   --add_task_name False \
+   --add_dataset_name False \
+   --overwrite_output_dir \
+   --overwrite_cache \
+   --lr_scheduler_type constant \
+   --warmup_steps 0 \
+   --logging_strategy steps \
+   --logging_steps 10 \
+   --metric_for_best_model eval_exact_match_for_amazon \
+   --evaluation_strategy epoch \
+   --save_strategy epoch \
+   --save_total_limit 1 \
+   --load_best_model_at_end \
+   --lora_r 8 \
+   --lora_alpha 32 \
+   --lora_dropout 0.0 \
+   --data_replay_freq -1 \
+   --mlp_hidden_dim 100 \
+   --model_name specroute \
+   --target_routing_alpha 0.8 \
+   --gen_data_dir CL_Benchmark \
+   --threshold 0.995 \
+   --transthreshold 0.995 \
+   $FP16_FLAG
+rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/checkpoint*
+sleep 5
+if [ "$GPU_MODE" = "t4_2gpu" ]; then
+    BSZ=4; GA=4; EVAL_BSZ=128
+elif [ "$GPU_MODE" = "t4_1gpu" ]; then
+    BSZ=8; GA=4; EVAL_BSZ=128
+else
+    BSZ=32; GA=1; EVAL_BSZ=128
+fi
+CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
+   --do_train \
+   --do_predict \
+   --predict_with_generate \
+   --model_name_or_path $2 \
+   --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights \
+   --data_dir CL_Benchmark \
+   --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
+   --task_config_dir configs/gen_script_long_order3_t5_configs/mnli \
+   --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli \
+   --per_device_train_batch_size $BSZ \
+   --per_device_eval_batch_size $EVAL_BSZ \
+   --gradient_accumulation_steps $GA \
+   --learning_rate 0.0003 \
+   --num_train_epochs 10 \
+   --run_name gen_script_long_order3_t5_small_specroute_v3 \
+   --max_source_length 512 \
+   --max_target_length 50 \
+   --generation_max_length 50 \
+   --add_task_name False \
+   --add_dataset_name False \
+   --overwrite_output_dir \
+   --overwrite_cache \
+   --lr_scheduler_type constant \
+   --warmup_steps 0 \
+   --logging_strategy steps \
+   --logging_steps 10 \
+   --metric_for_best_model eval_exact_match_for_mnli \
+   --evaluation_strategy epoch \
+   --save_strategy epoch \
+   --save_total_limit 1 \
+   --load_best_model_at_end \
+   --lora_r 8 \
+   --lora_alpha 32 \
+   --lora_dropout 0.0 \
+   --data_replay_freq -1 \
+   --mlp_hidden_dim 100 \
+   --model_name specroute \
+   --target_routing_alpha 0.8 \
+   --gen_data_dir CL_Benchmark \
+   --threshold 0.995 \
+   --transthreshold 0.995 \
+   $FP16_FLAG
+rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/checkpoint*
+sleep 5
+if [ "$GPU_MODE" = "t4_2gpu" ]; then
+    BSZ=4; GA=4; EVAL_BSZ=128
+elif [ "$GPU_MODE" = "t4_1gpu" ]; then
+    BSZ=8; GA=4; EVAL_BSZ=128
+else
+    BSZ=32; GA=1; EVAL_BSZ=128
+fi
+CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
+   --do_train \
+   --do_predict \
+   --predict_with_generate \
+   --model_name_or_path $2 \
+   --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights \
+   --data_dir CL_Benchmark \
+   --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
+   --task_config_dir configs/gen_script_long_order3_t5_configs/cb \
+   --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb \
+   --per_device_train_batch_size $BSZ \
+   --per_device_eval_batch_size $EVAL_BSZ \
+   --gradient_accumulation_steps $GA \
+   --learning_rate 0.0003 \
+   --num_train_epochs 10 \
+   --run_name gen_script_long_order3_t5_small_specroute_v3 \
+   --max_source_length 512 \
+   --max_target_length 50 \
+   --generation_max_length 50 \
+   --add_task_name False \
+   --add_dataset_name False \
+   --overwrite_output_dir \
+   --overwrite_cache \
+   --lr_scheduler_type constant \
+   --warmup_steps 0 \
+   --logging_strategy steps \
+   --logging_steps 10 \
+   --metric_for_best_model eval_exact_match_for_cb \
+   --evaluation_strategy epoch \
+   --save_strategy epoch \
+   --save_total_limit 1 \
+   --load_best_model_at_end \
+   --lora_r 8 \
+   --lora_alpha 32 \
+   --lora_dropout 0.0 \
+   --data_replay_freq -1 \
+   --mlp_hidden_dim 100 \
+   --model_name specroute \
+   --target_routing_alpha 0.8 \
+   --gen_data_dir CL_Benchmark \
+   --threshold 0.995 \
+   --transthreshold 0.995 \
+   $FP16_FLAG
+rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/checkpoint*
+sleep 5
+if [ "$GPU_MODE" = "t4_2gpu" ]; then
+    BSZ=4; GA=4; EVAL_BSZ=128
+elif [ "$GPU_MODE" = "t4_1gpu" ]; then
+    BSZ=8; GA=4; EVAL_BSZ=128
+else
+    BSZ=32; GA=1; EVAL_BSZ=128
+fi
+CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
+   --do_train \
+   --do_predict \
+   --predict_with_generate \
+   --model_name_or_path $2 \
+   --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights \
+   --data_dir CL_Benchmark \
+   --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
+   --task_config_dir configs/gen_script_long_order3_t5_configs/copa \
+   --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa \
+   --per_device_train_batch_size $BSZ \
+   --per_device_eval_batch_size $EVAL_BSZ \
+   --gradient_accumulation_steps $GA \
+   --learning_rate 0.0003 \
+   --num_train_epochs 10 \
+   --run_name gen_script_long_order3_t5_small_specroute_v3 \
+   --max_source_length 512 \
+   --max_target_length 50 \
+   --generation_max_length 50 \
+   --add_task_name False \
+   --add_dataset_name False \
+   --overwrite_output_dir \
+   --overwrite_cache \
+   --lr_scheduler_type constant \
+   --warmup_steps 0 \
+   --logging_strategy steps \
+   --logging_steps 10 \
+   --metric_for_best_model eval_exact_match_for_copa \
+   --evaluation_strategy epoch \
+   --save_strategy epoch \
+   --save_total_limit 1 \
+   --load_best_model_at_end \
+   --lora_r 8 \
+   --lora_alpha 32 \
+   --lora_dropout 0.0 \
+   --data_replay_freq -1 \
+   --mlp_hidden_dim 100 \
+   --model_name specroute \
+   --target_routing_alpha 0.8 \
+   --gen_data_dir CL_Benchmark \
+   --threshold 0.995 \
+   --transthreshold 0.995 \
+   $FP16_FLAG
+rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/checkpoint*
+sleep 5
+if [ "$GPU_MODE" = "t4_2gpu" ]; then
+    BSZ=4; GA=4; EVAL_BSZ=128
+elif [ "$GPU_MODE" = "t4_1gpu" ]; then
+    BSZ=8; GA=4; EVAL_BSZ=128
+else
+    BSZ=32; GA=1; EVAL_BSZ=128
+fi
+CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
+   --do_train \
+   --do_predict \
+   --predict_with_generate \
+   --model_name_or_path $2 \
+   --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/saved_weights \
+   --data_dir CL_Benchmark \
+   --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
+   --task_config_dir configs/gen_script_long_order3_t5_configs/qqp \
+   --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp \
+   --per_device_train_batch_size $BSZ \
+   --per_device_eval_batch_size $EVAL_BSZ \
+   --gradient_accumulation_steps $GA \
+   --learning_rate 0.0003 \
+   --num_train_epochs 10 \
+   --run_name gen_script_long_order3_t5_small_specroute_v3 \
+   --max_source_length 512 \
+   --max_target_length 50 \
+   --generation_max_length 50 \
+   --add_task_name False \
+   --add_dataset_name False \
+   --overwrite_output_dir \
+   --overwrite_cache \
+   --lr_scheduler_type constant \
+   --warmup_steps 0 \
+   --logging_strategy steps \
+   --logging_steps 10 \
+   --metric_for_best_model eval_exact_match_for_qqp \
+   --evaluation_strategy epoch \
+   --save_strategy epoch \
+   --save_total_limit 1 \
+   --load_best_model_at_end \
+   --lora_r 8 \
+   --lora_alpha 32 \
+   --lora_dropout 0.0 \
+   --data_replay_freq -1 \
+   --mlp_hidden_dim 100 \
+   --model_name specroute \
+   --target_routing_alpha 0.8 \
+   --gen_data_dir CL_Benchmark \
+   --threshold 0.995 \
+   --transthreshold 0.995 \
+   $FP16_FLAG
+rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp/checkpoint*
+sleep 5
+if [ "$GPU_MODE" = "t4_2gpu" ]; then
+    BSZ=4; GA=4; EVAL_BSZ=128
+elif [ "$GPU_MODE" = "t4_1gpu" ]; then
+    BSZ=8; GA=4; EVAL_BSZ=128
+else
+    BSZ=32; GA=1; EVAL_BSZ=128
+fi
+CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
+   --do_train \
+   --do_predict \
+   --predict_with_generate \
+   --model_name_or_path $2 \
+   --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp/saved_weights \
+   --data_dir CL_Benchmark \
+   --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
+   --task_config_dir configs/gen_script_long_order3_t5_configs/rte \
+   --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/7-rte \
+   --per_device_train_batch_size $BSZ \
+   --per_device_eval_batch_size $EVAL_BSZ \
+   --gradient_accumulation_steps $GA \
+   --learning_rate 0.0003 \
+   --num_train_epochs 10 \
+   --run_name gen_script_long_order3_t5_small_specroute_v3 \
+   --max_source_length 512 \
+   --max_target_length 50 \
+   --generation_max_length 50 \
+   --add_task_name False \
+   --add_dataset_name False \
+   --overwrite_output_dir \
+   --overwrite_cache \
+   --lr_scheduler_type constant \
+   --warmup_steps 0 \
+   --logging_strategy steps \
+   --logging_steps 10 \
+   --metric_for_best_model eval_exact_match_for_rte \
+   --evaluation_strategy epoch \
+   --save_strategy epoch \
+   --save_total_limit 1 \
+   --load_best_model_at_end \
+   --lora_r 8 \
+   --lora_alpha 32 \
+   --lora_dropout 0.0 \
+   --data_replay_freq -1 \
+   --mlp_hidden_dim 100 \
+   --model_name specroute \
+   --target_routing_alpha 0.8 \
+   --gen_data_dir CL_Benchmark \
+   --threshold 0.995 \
+   --transthreshold 0.995 \
+   $FP16_FLAG
+rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/7-rte/checkpoint*
+sleep 5
+if [ "$GPU_MODE" = "t4_2gpu" ]; then
+    BSZ=4; GA=4; EVAL_BSZ=128
+elif [ "$GPU_MODE" = "t4_1gpu" ]; then
+    BSZ=8; GA=4; EVAL_BSZ=128
+else
+    BSZ=32; GA=1; EVAL_BSZ=128
+fi
+CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
+   --do_train \
+   --do_predict \
+   --predict_with_generate \
+   --model_name_or_path $2 \
+   --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/7-rte/saved_weights \
+   --data_dir CL_Benchmark \
+   --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
+   --task_config_dir configs/gen_script_long_order3_t5_configs/imdb \
+   --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/8-imdb \
+   --per_device_train_batch_size $BSZ \
+   --per_device_eval_batch_size $EVAL_BSZ \
+   --gradient_accumulation_steps $GA \
+   --learning_rate 0.0003 \
+   --num_train_epochs 10 \
+   --run_name gen_script_long_order3_t5_small_specroute_v3 \
+   --max_source_length 512 \
+   --max_target_length 50 \
+   --generation_max_length 50 \
+   --add_task_name False \
+   --add_dataset_name False \
+   --overwrite_output_dir \
+   --overwrite_cache \
+   --lr_scheduler_type constant \
+   --warmup_steps 0 \
+   --logging_strategy steps \
+   --logging_steps 10 \
+   --metric_for_best_model eval_exact_match_for_imdb \
+   --evaluation_strategy epoch \
+   --save_strategy epoch \
+   --save_total_limit 1 \
+   --load_best_model_at_end \
+   --lora_r 8 \
+   --lora_alpha 32 \
+   --lora_dropout 0.0 \
+   --data_replay_freq -1 \
+   --mlp_hidden_dim 100 \
+   --model_name specroute \
+   --target_routing_alpha 0.8 \
+   --gen_data_dir CL_Benchmark \
+   --threshold 0.995 \
+   --transthreshold 0.995 \
+   $FP16_FLAG
+rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/8-imdb/checkpoint*
+sleep 5
+if [ "$GPU_MODE" = "t4_2gpu" ]; then
+    BSZ=4; GA=4; EVAL_BSZ=128
+elif [ "$GPU_MODE" = "t4_1gpu" ]; then
+    BSZ=8; GA=4; EVAL_BSZ=128
+else
+    BSZ=32; GA=1; EVAL_BSZ=128
+fi
+CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
+   --do_train \
+   --do_predict \
+   --predict_with_generate \
+   --model_name_or_path $2 \
+   --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/7-rte/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/8-imdb/saved_weights \
+   --data_dir CL_Benchmark \
+   --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
+   --task_config_dir configs/gen_script_long_order3_t5_configs/sst2 \
+   --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/9-sst2 \
+   --per_device_train_batch_size $BSZ \
+   --per_device_eval_batch_size $EVAL_BSZ \
+   --gradient_accumulation_steps $GA \
+   --learning_rate 0.0003 \
+   --num_train_epochs 10 \
+   --run_name gen_script_long_order3_t5_small_specroute_v3 \
+   --max_source_length 512 \
+   --max_target_length 50 \
+   --generation_max_length 50 \
+   --add_task_name False \
+   --add_dataset_name False \
+   --overwrite_output_dir \
+   --overwrite_cache \
+   --lr_scheduler_type constant \
+   --warmup_steps 0 \
+   --logging_strategy steps \
+   --logging_steps 10 \
+   --metric_for_best_model eval_exact_match_for_sst2 \
+   --evaluation_strategy epoch \
+   --save_strategy epoch \
+   --save_total_limit 1 \
+   --load_best_model_at_end \
+   --lora_r 8 \
+   --lora_alpha 32 \
+   --lora_dropout 0.0 \
+   --data_replay_freq -1 \
+   --mlp_hidden_dim 100 \
+   --model_name specroute \
+   --target_routing_alpha 0.8 \
+   --gen_data_dir CL_Benchmark \
+   --threshold 0.995 \
+   --transthreshold 0.995 \
+   $FP16_FLAG
+rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/9-sst2/checkpoint*
+sleep 5
+if [ "$GPU_MODE" = "t4_2gpu" ]; then
+    BSZ=4; GA=4; EVAL_BSZ=128
+elif [ "$GPU_MODE" = "t4_1gpu" ]; then
+    BSZ=8; GA=4; EVAL_BSZ=128
+else
+    BSZ=32; GA=1; EVAL_BSZ=128
+fi
+CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
+   --do_train \
+   --do_predict \
+   --predict_with_generate \
+   --model_name_or_path $2 \
+   --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/7-rte/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/8-imdb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/9-sst2/saved_weights \
+   --data_dir CL_Benchmark \
+   --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
+   --task_config_dir configs/gen_script_long_order3_t5_configs/dbpedia \
+   --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/10-dbpedia \
+   --per_device_train_batch_size $BSZ \
+   --per_device_eval_batch_size $EVAL_BSZ \
+   --gradient_accumulation_steps $GA \
+   --learning_rate 0.0003 \
+   --num_train_epochs 10 \
+   --run_name gen_script_long_order3_t5_small_specroute_v3 \
+   --max_source_length 512 \
+   --max_target_length 50 \
+   --generation_max_length 50 \
+   --add_task_name False \
+   --add_dataset_name False \
+   --overwrite_output_dir \
+   --overwrite_cache \
+   --lr_scheduler_type constant \
+   --warmup_steps 0 \
+   --logging_strategy steps \
+   --logging_steps 10 \
+   --metric_for_best_model eval_exact_match_for_dbpedia \
+   --evaluation_strategy epoch \
+   --save_strategy epoch \
+   --save_total_limit 1 \
+   --load_best_model_at_end \
+   --lora_r 8 \
+   --lora_alpha 32 \
+   --lora_dropout 0.0 \
+   --data_replay_freq -1 \
+   --mlp_hidden_dim 100 \
+   --model_name specroute \
+   --target_routing_alpha 0.8 \
+   --gen_data_dir CL_Benchmark \
+   --threshold 0.995 \
+   --transthreshold 0.995 \
+   $FP16_FLAG
+rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/10-dbpedia/checkpoint*
+sleep 5
+if [ "$GPU_MODE" = "t4_2gpu" ]; then
+    BSZ=4; GA=4; EVAL_BSZ=128
+elif [ "$GPU_MODE" = "t4_1gpu" ]; then
+    BSZ=8; GA=4; EVAL_BSZ=128
+else
+    BSZ=32; GA=1; EVAL_BSZ=128
+fi
+CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
+   --do_train \
+   --do_predict \
+   --predict_with_generate \
+   --model_name_or_path $2 \
+   --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/7-rte/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/8-imdb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/9-sst2/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/10-dbpedia/saved_weights \
+   --data_dir CL_Benchmark \
+   --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
+   --task_config_dir configs/gen_script_long_order3_t5_configs/agnews \
+   --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/11-agnews \
+   --per_device_train_batch_size $BSZ \
+   --per_device_eval_batch_size $EVAL_BSZ \
+   --gradient_accumulation_steps $GA \
+   --learning_rate 0.0003 \
+   --num_train_epochs 10 \
+   --run_name gen_script_long_order3_t5_small_specroute_v3 \
+   --max_source_length 512 \
+   --max_target_length 50 \
+   --generation_max_length 50 \
+   --add_task_name False \
+   --add_dataset_name False \
+   --overwrite_output_dir \
+   --overwrite_cache \
+   --lr_scheduler_type constant \
+   --warmup_steps 0 \
+   --logging_strategy steps \
+   --logging_steps 10 \
+   --metric_for_best_model eval_exact_match_for_agnews \
+   --evaluation_strategy epoch \
+   --save_strategy epoch \
+   --save_total_limit 1 \
+   --load_best_model_at_end \
+   --lora_r 8 \
+   --lora_alpha 32 \
+   --lora_dropout 0.0 \
+   --data_replay_freq -1 \
+   --mlp_hidden_dim 100 \
+   --model_name specroute \
+   --target_routing_alpha 0.8 \
+   --gen_data_dir CL_Benchmark \
+   --threshold 0.995 \
+   --transthreshold 0.995 \
+   $FP16_FLAG
+rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/11-agnews/checkpoint*
+sleep 5
+if [ "$GPU_MODE" = "t4_2gpu" ]; then
+    BSZ=4; GA=4; EVAL_BSZ=128
+elif [ "$GPU_MODE" = "t4_1gpu" ]; then
+    BSZ=8; GA=4; EVAL_BSZ=128
+else
+    BSZ=32; GA=1; EVAL_BSZ=128
+fi
+CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
+   --do_train \
+   --do_predict \
+   --predict_with_generate \
+   --model_name_or_path $2 \
+   --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/7-rte/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/8-imdb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/9-sst2/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/10-dbpedia/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/11-agnews/saved_weights \
+   --data_dir CL_Benchmark \
+   --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
+   --task_config_dir configs/gen_script_long_order3_t5_configs/yahoo \
+   --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/12-yahoo \
+   --per_device_train_batch_size $BSZ \
+   --per_device_eval_batch_size $EVAL_BSZ \
+   --gradient_accumulation_steps $GA \
+   --learning_rate 0.0003 \
+   --num_train_epochs 10 \
+   --run_name gen_script_long_order3_t5_small_specroute_v3 \
+   --max_source_length 512 \
+   --max_target_length 50 \
+   --generation_max_length 50 \
+   --add_task_name False \
+   --add_dataset_name False \
+   --overwrite_output_dir \
+   --overwrite_cache \
+   --lr_scheduler_type constant \
+   --warmup_steps 0 \
+   --logging_strategy steps \
+   --logging_steps 10 \
+   --metric_for_best_model eval_exact_match_for_yahoo \
+   --evaluation_strategy epoch \
+   --save_strategy epoch \
+   --save_total_limit 1 \
+   --load_best_model_at_end \
+   --lora_r 8 \
+   --lora_alpha 32 \
+   --lora_dropout 0.0 \
+   --data_replay_freq -1 \
+   --mlp_hidden_dim 100 \
+   --model_name specroute \
+   --target_routing_alpha 0.8 \
+   --gen_data_dir CL_Benchmark \
+   --threshold 0.995 \
+   --transthreshold 0.995 \
+   $FP16_FLAG
+rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/12-yahoo/checkpoint*
+sleep 5
+if [ "$GPU_MODE" = "t4_2gpu" ]; then
+    BSZ=4; GA=4; EVAL_BSZ=128
+elif [ "$GPU_MODE" = "t4_1gpu" ]; then
+    BSZ=8; GA=4; EVAL_BSZ=128
+else
+    BSZ=32; GA=1; EVAL_BSZ=128
+fi
+CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
+   --do_train \
+   --do_predict \
+   --predict_with_generate \
+   --model_name_or_path $2 \
+   --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/7-rte/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/8-imdb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/9-sst2/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/10-dbpedia/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/11-agnews/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/12-yahoo/saved_weights \
+   --data_dir CL_Benchmark \
+   --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
+   --task_config_dir configs/gen_script_long_order3_t5_configs/multirc \
+   --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/13-multirc \
+   --per_device_train_batch_size $BSZ \
+   --per_device_eval_batch_size $EVAL_BSZ \
+   --gradient_accumulation_steps $GA \
+   --learning_rate 0.0003 \
+   --num_train_epochs 10 \
+   --run_name gen_script_long_order3_t5_small_specroute_v3 \
+   --max_source_length 512 \
+   --max_target_length 50 \
+   --generation_max_length 50 \
+   --add_task_name False \
+   --add_dataset_name False \
+   --overwrite_output_dir \
+   --overwrite_cache \
+   --lr_scheduler_type constant \
+   --warmup_steps 0 \
+   --logging_strategy steps \
+   --logging_steps 10 \
+   --metric_for_best_model eval_exact_match_for_multirc \
+   --evaluation_strategy epoch \
+   --save_strategy epoch \
+   --save_total_limit 1 \
+   --load_best_model_at_end \
+   --lora_r 8 \
+   --lora_alpha 32 \
+   --lora_dropout 0.0 \
+   --data_replay_freq -1 \
+   --mlp_hidden_dim 100 \
+   --model_name specroute \
+   --target_routing_alpha 0.8 \
+   --gen_data_dir CL_Benchmark \
+   --threshold 0.995 \
+   --transthreshold 0.995 \
+   $FP16_FLAG
+rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/13-multirc/checkpoint*
+sleep 5
+if [ "$GPU_MODE" = "t4_2gpu" ]; then
+    BSZ=4; GA=4; EVAL_BSZ=128
+elif [ "$GPU_MODE" = "t4_1gpu" ]; then
+    BSZ=8; GA=4; EVAL_BSZ=128
+else
+    BSZ=32; GA=1; EVAL_BSZ=128
+fi
+CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
+   --do_train \
+   --do_predict \
+   --predict_with_generate \
+   --model_name_or_path $2 \
+   --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/7-rte/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/8-imdb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/9-sst2/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/10-dbpedia/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/11-agnews/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/12-yahoo/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/13-multirc/saved_weights \
+   --data_dir CL_Benchmark \
+   --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
+   --task_config_dir configs/gen_script_long_order3_t5_configs/boolq \
+   --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/14-boolq \
+   --per_device_train_batch_size $BSZ \
+   --per_device_eval_batch_size $EVAL_BSZ \
+   --gradient_accumulation_steps $GA \
+   --learning_rate 0.0003 \
+   --num_train_epochs 10 \
+   --run_name gen_script_long_order3_t5_small_specroute_v3 \
+   --max_source_length 512 \
+   --max_target_length 50 \
+   --generation_max_length 50 \
+   --add_task_name False \
+   --add_dataset_name False \
+   --overwrite_output_dir \
+   --overwrite_cache \
+   --lr_scheduler_type constant \
+   --warmup_steps 0 \
+   --logging_strategy steps \
+   --logging_steps 10 \
+   --metric_for_best_model eval_exact_match_for_boolq \
+   --evaluation_strategy epoch \
+   --save_strategy epoch \
+   --save_total_limit 1 \
+   --load_best_model_at_end \
+   --lora_r 8 \
+   --lora_alpha 32 \
+   --lora_dropout 0.0 \
+   --data_replay_freq -1 \
+   --mlp_hidden_dim 100 \
+   --model_name specroute \
+   --target_routing_alpha 0.8 \
+   --gen_data_dir CL_Benchmark \
+   --threshold 0.995 \
+   --transthreshold 0.995 \
+   $FP16_FLAG
+rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/14-boolq/checkpoint*
+sleep 5
+if [ "$GPU_MODE" = "t4_2gpu" ]; then
+    BSZ=4; GA=4; EVAL_BSZ=128
+elif [ "$GPU_MODE" = "t4_1gpu" ]; then
+    BSZ=8; GA=4; EVAL_BSZ=128
+else
+    BSZ=32; GA=1; EVAL_BSZ=128
+fi
+CUDA_VISIBLE_DEVICES=$GPU_IDS python src/run_t5.py \
+   --do_train \
+   --do_predict \
+   --predict_with_generate \
+   --model_name_or_path $2 \
+   --previous_lora_path logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/1-yelp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/2-amazon/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/3-mnli/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/4-cb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/5-copa/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/6-qqp/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/7-rte/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/8-imdb/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/9-sst2/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/10-dbpedia/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/11-agnews/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/12-yahoo/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/13-multirc/saved_weights,logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/14-boolq/saved_weights \
+   --data_dir CL_Benchmark \
+   --task_order yelp,amazon,mnli,cb,copa,qqp,rte,imdb,sst2,dbpedia,agnews,yahoo,multirc,boolq,wic \
+   --task_config_dir configs/gen_script_long_order3_t5_configs/wic \
+   --output_dir logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/15-wic \
+   --per_device_train_batch_size $BSZ \
+   --per_device_eval_batch_size $EVAL_BSZ \
+   --gradient_accumulation_steps $GA \
+   --learning_rate 0.0003 \
+   --num_train_epochs 10 \
+   --run_name gen_script_long_order3_t5_small_specroute_v3 \
+   --max_source_length 512 \
+   --max_target_length 50 \
+   --generation_max_length 50 \
+   --add_task_name False \
+   --add_dataset_name False \
+   --overwrite_output_dir \
+   --overwrite_cache \
+   --lr_scheduler_type constant \
+   --warmup_steps 0 \
+   --logging_strategy steps \
+   --logging_steps 10 \
+   --metric_for_best_model eval_exact_match_for_wic \
+   --evaluation_strategy epoch \
+   --save_strategy epoch \
+   --save_total_limit 1 \
+   --load_best_model_at_end \
+   --lora_r 8 \
+   --lora_alpha 32 \
+   --lora_dropout 0.0 \
+   --data_replay_freq -1 \
+   --mlp_hidden_dim 100 \
+   --model_name specroute \
+   --target_routing_alpha 0.8 \
+   --gen_data_dir CL_Benchmark \
+   --threshold 0.995 \
+   --transthreshold 0.995 \
+   $FP16_FLAG
+rm -rf logs_and_outputs/gen_script_long_order3_t5_small_specroute_v3/outputs/15-wic/checkpoint*
+sleep 5

improve_gainlora/src/run_t5.py CHANGED Viewed

@@ -164,6 +164,14 @@ class ModelArguments:
         },
     )
     run_single: bool = field(
         default=False,
         metadata={
@@ -489,6 +497,7 @@ def main():
         'lora_alpha': model_args.lora_alpha,
         'lora_dropout': model_args.lora_dropout,
         'training_bias': model_args.training_bias,
     }
     if training_args.model_name in ['inflora', 'olora']:
@@ -1021,6 +1030,10 @@ def main():
         trainer.model.encoder.is_inference = True
         # Collect attention weights for GainLoRA KL replay (not needed for SpecRoute)
         if training_args.model_name != 'specroute':
             _ = trainer.predict(

         },
     )
+    target_routing_alpha: Optional[float] = field(
+        default=0.8,
+        metadata={
+            "help": "Target softmax routing weight for current task during training (SpecRoute V3). "
+                    "Adaptive bias = T*ln(alpha*n_old/(1-alpha)). Set 0 to use fixed training_bias."
+        },
+    )
     run_single: bool = field(
         default=False,
         metadata={
         'lora_alpha': model_args.lora_alpha,
         'lora_dropout': model_args.lora_dropout,
         'training_bias': model_args.training_bias,
+        'target_routing_alpha': model_args.target_routing_alpha,
     }
     if training_args.model_name in ['inflora', 'olora']:
         trainer.model.encoder.is_inference = True
+        # SpecRoute V3: precompute SVD of current task's LoRA for symmetric inference routing
+        if training_args.model_name == 'specroute' and hasattr(trainer.model.encoder, 'prepare_inference_routing'):
+            trainer.model.encoder.prepare_inference_routing()
         # Collect attention weights for GainLoRA KL replay (not needed for SpecRoute)
         if training_args.model_name != 'specroute':
             _ = trainer.predict(

improve_gainlora/src/t5_specroute.py CHANGED Viewed

@@ -183,10 +183,16 @@ class T5Stack(T5PreTrainedModel):
             # Spectral signatures loaded from previous tasks' saved weights
             self.spectral_signatures = []  # List[dict] — one dict per old task
             self.routing_temperature = prompt_config.get('attn_temperature', 1.0)
-            # Training bias: ensures current task gets adequate routing weight
-            # during training (compensates for A-based fit being lower than
-            # SVD-based fits of established old tasks). Set to 0 at inference.
-            self.training_bias = prompt_config.get('training_bias', 1.0)
             # For inference logging
             self.all_attn_weights = []
@@ -210,8 +216,13 @@ class T5Stack(T5PreTrainedModel):
         """
         Compute routing weights using spectral projection fits.
-        For each task, measures how much of the input's energy falls into that
-        task's operating subspace (defined by SVD of its LoRA weights).
         Args:
             avg_inputs_embeds: (B, 1, d_model) — averaged input token embeddings
@@ -224,29 +235,64 @@ class T5Stack(T5PreTrainedModel):
         fits = []
-        # 1. Current task fit: use A rows directly (not SVD of B@A).
-        # Motivation: At training start B=0, so SVD(B@A) gives σ≈0 → fit≈0 →
-        # routing weight≈0 → gradient≈0 → B can't learn (cold-start problem).
-        # A rows define the input subspace available to the current task
-        # (post null-space projection). This gives a stable, non-zero signal
-        # from initialization, measuring "how much of h's energy falls into
-        # the current task's available input subspace".
-        # Formula: fit_cur(h) = Σ_i (a_i · h)² / (r · ||h||²)
-        current_fits_layers = []
-        for block in self.block:
-            attn = block.layer[0].SelfAttention
-            for lora in [attn.lora_q, attn.lora_v]:
-                A = lora.lora_A.data.float()  # (r, d_model) — frozen
-                r = lora.r
-                A_h = A.to(h.device, dtype=h.dtype)
-                proj = torch.matmul(h, A_h.T)  # (B, 1, r)
-                fit = (proj ** 2).sum(dim=-1) / (r * h_norm_sq)  # (B, 1)
-                current_fits_layers.append(fit)
-        current_fit = torch.stack(current_fits_layers).mean(dim=0)  # (B, 1)
-        # Training bias: boost current task during training (β>0 during train, 0 at inference)
-        if self.training and hasattr(self, 'training_bias'):
-            current_fit = current_fit + self.training_bias
-        fits.append(current_fit)
         # 2. Previous tasks fit: use spectral signatures (V, sigma)
         for sig_dict in self.spectral_signatures:
@@ -278,6 +324,31 @@ class T5Stack(T5PreTrainedModel):
         return weights.unsqueeze(2)  # (B, n_tasks, 1)
     def parallelize(self, device_map=None):
         self.device_map = (
             get_device_map(len(self.block), range(torch.cuda.device_count()))

             # Spectral signatures loaded from previous tasks' saved weights
             self.spectral_signatures = []  # List[dict] — one dict per old task
             self.routing_temperature = prompt_config.get('attn_temperature', 1.0)
+            # Adaptive training bias: β = T·ln(α·n_old/(1−α))
+            # Ensures current task gets consistent routing weight ~α regardless
+            # of total number of tasks (fixes softmax dilution with constant bias).
+            # At inference, uses symmetric SVD-based routing instead (no bias needed).
+            self._target_routing_alpha = prompt_config.get('target_routing_alpha', 0.8)
+            # Precomputed SVD of current task's LoRA for symmetric inference routing.
+            # Populated by prepare_inference_routing() before prediction.
+            self._current_task_svd = None
             # For inference logging
             self.all_attn_weights = []
         """
         Compute routing weights using spectral projection fits.
+        Training mode: A-row fit + adaptive bias for current task (cold-start safe).
+        Inference mode: SVD-based fit for ALL tasks (symmetric, no bias needed).
+        The adaptive bias β = T·ln(α·n_old/(1−α)) ensures the current task gets
+        routing weight ≈ α regardless of the number of previous tasks, compensating
+        for softmax dilution. At inference, all tasks use the same σ²-weighted
+        Rayleigh quotient (SVD fit), eliminating the train-test asymmetry.
         Args:
             avg_inputs_embeds: (B, 1, d_model) — averaged input token embeddings
         fits = []
+        if self.training:
+            # === TRAINING MODE: A-row fit + adaptive bias for current task ===
+            # A rows give stable non-zero signal even when B=0 (cold-start).
+            current_fits_layers = []
+            for block in self.block:
+                attn = block.layer[0].SelfAttention
+                for lora in [attn.lora_q, attn.lora_v]:
+                    A = lora.lora_A.data.float()  # (r, d_model) — frozen
+                    r = lora.r
+                    A_h = A.to(h.device, dtype=h.dtype)
+                    proj = torch.matmul(h, A_h.T)  # (B, 1, r)
+                    fit = (proj ** 2).sum(dim=-1) / (r * h_norm_sq)  # (B, 1)
+                    current_fits_layers.append(fit)
+            current_fit = torch.stack(current_fits_layers).mean(dim=0)  # (B, 1)
+            # Adaptive training bias: β = T·ln(α·n_old/(1−α))
+            n_old = len(self.spectral_signatures)
+            if n_old > 0:
+                alpha = self._target_routing_alpha
+                beta = self.routing_temperature * math.log(alpha * n_old / (1.0 - alpha))
+                current_fit = current_fit + beta
+            fits.append(current_fit)
+        else:
+            # === INFERENCE MODE: SVD-based fit for current task (symmetric) ===
+            # After training, B≠0 so SVD(B@A) gives meaningful signatures.
+            # Using the same σ²-weighted formula as old tasks eliminates the
+            # A-row vs SVD measurement asymmetry.
+            if self._current_task_svd is not None:
+                cur_fits = []
+                for key, sig_data in self._current_task_svd.items():
+                    if not key.startswith('enc.'):
+                        continue
+                    V = sig_data['V'].to(h.device, dtype=h.dtype)
+                    sigma = sig_data['sigma'].to(h.device, dtype=h.dtype)
+                    proj = torch.matmul(h, V.T)
+                    sigma_sq = sigma ** 2
+                    sigma_sq_sum = sigma_sq.sum() + 1e-8
+                    weighted_proj = (proj ** 2 * sigma_sq.unsqueeze(0).unsqueeze(0)).sum(dim=-1)
+                    fit = weighted_proj / (sigma_sq_sum * h_norm_sq)
+                    cur_fits.append(fit)
+                if cur_fits:
+                    current_fit = torch.stack(cur_fits).mean(dim=0)
+                else:
+                    current_fit = torch.zeros(h.shape[0], 1, device=h.device, dtype=h.dtype)
+            else:
+                # Fallback: A-row fit without bias (backward compat)
+                current_fits_layers = []
+                for block in self.block:
+                    attn = block.layer[0].SelfAttention
+                    for lora in [attn.lora_q, attn.lora_v]:
+                        A = lora.lora_A.data.float()
+                        r = lora.r
+                        A_h = A.to(h.device, dtype=h.dtype)
+                        proj = torch.matmul(h, A_h.T)
+                        fit = (proj ** 2).sum(dim=-1) / (r * h_norm_sq)
+                        current_fits_layers.append(fit)
+                current_fit = torch.stack(current_fits_layers).mean(dim=0)
+            fits.append(current_fit)
         # 2. Previous tasks fit: use spectral signatures (V, sigma)
         for sig_dict in self.spectral_signatures:
         return weights.unsqueeze(2)  # (B, n_tasks, 1)
+    def prepare_inference_routing(self):
+        """
+        Precompute SVD of current task's LoRA for symmetric inference routing.
+        Must be called after training (B≠0) and before prediction. Computes
+        spectral signatures from the current (most recently trained) task's
+        LoRA weights, enabling the same σ²-weighted Rayleigh quotient formula
+        for all tasks during inference.
+        """
+        sigs = {}
+        with torch.no_grad():
+            for j, block in enumerate(self.block):
+                attn = block.layer[0].SelfAttention
+                for name, lora in [('q', attn.lora_q), ('v', attn.lora_v)]:
+                    A = lora.lora_A.data.float()
+                    B = lora.lora_B.data.float()
+                    r = lora.r
+                    S, Vt = _thin_svd_low_rank(B, A)
+                    sigs[f'enc.{j}.self.{name}'] = {
+                        'V': Vt[:r].cpu(),
+                        'sigma': S[:r].cpu()
+                    }
+        self._current_task_svd = sigs
+        print(f"[SpecRoute] Prepared inference routing: {len(sigs)} encoder SVD signatures for current task")
     def parallelize(self, device_map=None):
         self.device_map = (
             get_device_map(len(self.block), range(torch.cuda.device_count()))

results/experiment_versions.md CHANGED Viewed

@@ -166,19 +166,110 @@ ROOT GainLoRA giải quyết vấn đề này nhờ trans_input MLP map input m
 - threshold/transthreshold: 0.980 (kept from previous)
 ### Kết quả
-> *Chưa chạy — cần thực nghiệm*
 ### Kỳ vọng
-- Cold-start fix → tasks sau (8+) nhận routing weight đủ lớn (>30%) → B có thể học
-- Threshold 0.980 → sentiment tasks (imdb, sst2) có null-space capacity cho learning
-- Training bias β=1.0 → current task dominant trong routing khi training, không ảnh hưởng inference
-- Fair BSZ (effective=32 = ROOT) → so sánh AP trực tiếp
-- Overall AP: kỳ vọng >50 (V1=39.74), mục tiêu tiếp cận ROOT=59.70
-### Nếu kết quả không đạt → V3 Plan
-- **V3a**: Adaptive training bias β per-task (higher for later tasks khi null-space nhỏ hơn)
-- **V3b**: Adaptive threshold per-layer (layers gần output cần threshold thấp hơn)
-- **V3c**: Warm-start: initialize B_new từ weighted combination of old B vectors (zero-replay compliant)
 ---
@@ -238,3 +329,5 @@ V2 đã tắt replay (`data_replay_freq=-1`), match ROOT. Runtime ước tính n
 | 2025-XX-XX | V2.0 (hủy) | ~~Replay~~ | ~~Thêm experience replay~~ — **BỊ HỦY** do vi phạm zero-replay constraint |
 | 2025-XX-XX | V2.0 | Bug fix + Fair | A-row routing (fix cold-start), training bias β=1.0, threshold 0.980, fair BSZ=32 |
 | 2025-XX-XX | V2.1 | Perf Optimization | Thin QR+SVD (~186× speedup per SVD, zero accuracy loss) |

 - threshold/transthreshold: 0.980 (kept from previous)
 ### Kết quả
+| # | Task | ROOT EM | V2 EM | Δ | Ghi chú |
+|---|------|---------|-------|---|---------|
+| 1 | yelp | 56.01 | 35.91 | -20.10 | Below |
+| 2 | amazon | 52.05 | 36.58 | -15.47 | Below |
+| 3 | mnli | 34.07 | 0.25 | -33.82 | Catastrophic forgetting (peak 31.25 ep8) |
+| 4 | cb | 3.57 | 0.00 | -3.57 | EM=0 — misrouted garbage output |
+| 5 | copa | 42.00 | **47.00** | **+5.00** | ✅ Better |
+| 6 | qqp | 76.96 | **77.03** | **+0.07** | ✅ Tie |
+| 7 | rte | 45.85 | 0.36 | -45.49 | Catastrophic forgetting (peak 51.26 ep4) |
+| 8 | imdb | 89.51 | 0.00 | -89.51 | ❌ EM=0 — pred "positive"/"negative" vs label "Good"/"Bad" |
+| 9 | sst2 | 85.21 | 0.00 | -85.21 | ❌ EM=0 — pred "negative" vs label "Bad" |
+| 10 | dbpedia | 98.16 | 71.95 | -26.21 | Below |
+| 11 | agnews | 88.37 | 68.21 | -20.16 | Below |
+| 12 | yahoo | 57.28 | 6.82 | -50.46 | Very low |
+| 13 | multirc | 50.52 | **55.42** | **+4.90** | ✅ Better |
+| 14 | boolq | 60.43 | **61.44** | **+1.01** | ✅ Better |
+| 15 | wic | 55.49 | 0.00 | -55.49 | ❌ EM=0 — pred "the same meaning" vs label "True" |
+| | **AP(EM)** | **59.70** | **30.73** | **-28.97** | |
+| | **AP(rougeL)** | **61.66** | **38.00** | **-23.66** | |
+### Phân tích chi tiết
+**Nhóm 1: EM=0 do MISROUTING (4 tasks)**
+- imdb pred "positive"/"negative" → đây là label vocabulary của yelp/amazon → routing gửi input imdb đến LoRA cũ
+- sst2 pred "negative" → tương tự, routed to yelp LoRA
+- wic pred "the same meaning"/"different" → label đúng là "True"/"False" → routed to wrong expert
+- cb pred gibberish ("bedroom", "virtuous") → completely misrouted
+**Nhóm 2: Catastrophic forgetting (2 tasks)**
+- mnli: EM peak=31.25 tại ep8, nhưng final=0.25 → degenerate (always "neutral")
+- rte: EM peak=51.26 tại ep4, final=0.36 → overfit rồi collapse
+**ROOT CAUSE: Constant β=1.0 không scale theo số task**
+| n_tasks | Training w_cur | Inference w_cur | Gap |
+|---------|---------------|----------------|-----|
+| 1 | 100% | 100% | 1.0x |
+| 2 | 71.5% | 48.0% | 1.5x |
+| 8 (imdb) | **26.4%** | **11.7%** | 2.3x |
+| 15 (wic) | **15.2%** | **6.2%** | 2.5x |
+Task 8 (imdb) chỉ nhận 26.4% routing weight khi training → 73.6% gradient đi qua LoRA cũ → model học label vocabulary của task cũ thay vì task hiện tại.
+**SECONDARY CAUSE: A-row fit vs SVD fit asymmetry**
+- Training: current task dùng A-row fit (uniform weighting)
+- Inference: current task VẪN dùng A-row fit (không có bias) nhưng old tasks dùng SVD fit (σ²-weighted)
+- SVD fit hệ thống cao hơn A-row fit → old tasks luôn thắng routing tại inference
+### Kỳ vọng
+- Cold-start fix → giải quyết EM=0 ở task 1-3 ✅
+- Training bias β=1.0 → chỉ đủ cho ≤3 tasks, KHÔNG đủ cho 8+ tasks ❌
+---
+## Version 3.0 — SpecRoute V3: Adaptive Bias + Symmetric Inference Routing
+### Thay đổi về Methodology (CẬP NHẬT SPECROUTE_IDEA.md)
+**1. Adaptive Training Bias (thay thế constant β=1.0):**
+$$\beta(n) = \tau \cdot \ln\!\left(\frac{\alpha_{\mathrm{target}} \cdot n}{1 - \alpha_{\mathrm{target}}}\right)$$
+- $n$ = số old tasks = `len(spectral_signatures)`
+- $\alpha_{\mathrm{target}}$ = target routing weight (default 0.8)
+- Đảm bảo w_cur ≈ 80% bất kể tổng số task
+- Derivation từ giải phương trình softmax: xem SPECROUTE_IDEA.md Section C2
+**2. Symmetric Inference Routing (thay thế A-row fit tại inference):**
+- Sau training, B≠0 → SVD(B@A) cho meaningful signatures
+- Gọi `prepare_inference_routing()` trước prediction
+- Inference: TẤT CẢ tasks (kể cả current) dùng cùng σ²-weighted Rayleigh quotient
+- Loại bỏ hoàn toàn asymmetry A-row vs SVD → measurement symmetry
+**3. Threshold 0.995 (match ROOT, thay vì 0.980):**
+- Bảo toàn null-space capacity cho tasks sau
+- Capacity: d/(r·(1-ε)) = 512/(8·0.005) = 12,800 tasks (rất dư)
+### Code Changes
+**`t5_specroute.py`:**
+- `compute_spectral_routing()`:
+  - Training: A-row fit + β(n) tự động từ len(spectral_signatures)
+  - Inference: dùng `_current_task_svd` (SVD-based fit) cho current task
+- Thêm `prepare_inference_routing()`: tính SVD(B@A) cho current task's LoRA
+- Thêm `_target_routing_alpha` config parameter
+- Xóa `training_bias` cố định
+**`run_t5.py`:**
+- Thêm `target_routing_alpha` argument (default 0.8)
+- Gọi `model.encoder.prepare_inference_routing()` trước inference
+**Shell script: `T5_small/gen_script_long_order3_t5_small_specroute_v3.sh`:**
+- `--target_routing_alpha 0.8` (thay `--training_bias 1.0`)
+- `--threshold 0.995` (thay 0.980)
+- `--transthreshold 0.995` (thay 0.980)
 ### Kỳ vọng
+- Adaptive bias → tasks 8+ nhận ≈80% routing weight → có thể học đúng label vocabulary
+- Symmetric inference → routing chính xác hơn tại eval → EM>0 cho imdb/sst2/wic
+- Threshold 0.995 → bảo vệ tốt hơn + routing margin lớn hơn (Theorem 1)
+### Kết quả
+> *Chưa chạy — cần thực nghiệm*
 ---
 | 2025-XX-XX | V2.0 (hủy) | ~~Replay~~ | ~~Thêm experience replay~~ — **BỊ HỦY** do vi phạm zero-replay constraint |
 | 2025-XX-XX | V2.0 | Bug fix + Fair | A-row routing (fix cold-start), training bias β=1.0, threshold 0.980, fair BSZ=32 |
 | 2025-XX-XX | V2.1 | Perf Optimization | Thin QR+SVD (~186× speedup per SVD, zero accuracy loss) |
+| 2026-03-17 | V2.0 | **Results** | AP(EM)=30.73 vs ROOT=59.70. 4 tasks EM=0 (imdb/sst2/wic/cb misrouting), 2 catastrophic forgetting |
+| 2026-03-17 | V3.0 | **Methodology** | Adaptive bias β(n)=τ·ln(α·n/(1-α)), symmetric SVD inference routing, threshold→0.995 |

results/specroute_v2_diagnosis.md ADDED Viewed

	@@ -0,0 +1,174 @@

+# SpecRoute V2 → V3: Chẩn đoán toàn diện & Kế hoạch khắc phục
+## 1. Tổng quan kết quả V2
+| Metric | SpecRoute V2 | ROOT (GainLoRA-InfLoRA) | Gap |
+|--------|-------------|------------------------|-----|
+| AP(EM) | 30.73 | 59.70 | **-28.97** |
+| AP(rougeL) | 38.00 | 61.66 | **-23.66** |
+### Phân loại thất bại chi tiết
+**Nhóm 1: EM = 0 suốt quá trình training (LABEL FORMAT MISMATCH)**
+| Task | Pos | Prediction | Ground Truth | Nguyên nhân |
+|------|-----|-----------|-------------|-------------|
+| imdb | 8 | "positive"/"negative" | "Good"/"Bad" | Misrouted → yelp/amazon LoRA |
+| sst2 | 9 | "negative" | "Bad" | Misrouted → yelp/amazon LoRA |
+| wic | 15 | "the same meaning"/"different" | "True"/"False" | Misrouted → unknown LoRA |
+| cb | 4 | "bedroom"/"yes"/"virtuous"/gibberish | "entailment"/"neutral"/"contradiction" | Misrouted → unrelated LoRA |
+**Nhóm 2: Catastrophic Forgetting**
+| Task | Best EM | Final EM | Nguyên nhân |
+|------|---------|----------|-------------|
+| mnli | 31.25 (ep8) | 0.25 | Degenerate → always "neutral" |
+| rte | 51.26 (ep4) | 0.36 | Degenerate → always "entailment" |
+**Nhóm 3: Hoạt động bình thường**
+| Task | SpecRoute | ROOT | Status |
+|------|-----------|------|--------|
+| copa | **47.00** | 42.00 | ✅ Better (+5.0) |
+| multirc | **55.42** | 50.52 | ✅ Better (+4.9) |
+| boolq | **61.44** | 60.43 | ✅ Better (+1.0) |
+| qqp | **77.03** | 76.96 | ✅ Tie |
+---
+## 2. Chẩn đoán nguyên nhân gốc (Root Cause Analysis)
+### 2.1 BUG CHỦ YẾU: Constant Training Bias Không Scale Theo Số Task
+**Công thức hiện tại:**
+$$\text{fit}_{\text{cur}} = \frac{1}{L}\sum_\ell \frac{\sum_i (a_i \cdot h)^2}{r \|h\|^2} + \beta \quad (\beta = 1.0 \text{ cố định})$$
+**Softmax routing weight cho current task:**
+$$w_{\text{cur}} = \frac{e^{(\text{fit}_{\text{cur}})/T}}{e^{(\text{fit}_{\text{cur}})/T} + (n-1) \cdot e^{\text{fit}_{\text{old}}/T}}$$
+Với $\beta = 1.0$, $T = 1.0$, fit_raw ≈ 0.12, fit_old ≈ 0.20:
+| n_tasks | Training $w_{\text{cur}}$ | Inference $w_{\text{cur}}$ |
+|---------|--------------------------|---------------------------|
+| 1 | 100% | 100% |
+| 2 | 71.5% | 48.0% |
+| 5 | 38.5% | 18.8% |
+| **8 (imdb)** | **26.4%** | **11.7%** |
+| 10 | 21.8% | 9.3% |
+| **15 (wic)** | **15.2%** | **6.2%** |
+**Hậu quả:**
+- Task 8 (imdb): Chỉ 26.4% routing weight khi training → 73.6% gradient signal đi qua LoRA cũ → model không thể học label "Good"/"Bad"
+- Task 15 (wic): Chỉ 15.2% routing weight → gần như không học được gì
+**So sánh với ROOT:** ROOT dùng sigmoid độc lập cho mỗi task: $w_k = |2\sigma(4\cos(x, \text{key}_k)) - 1|$. Không có zero-sum competition → mỗi task có thể đạt weight ~0.8 bất kể số task.
+### 2.2 BUG THỨ HAI: Bất đối xứng A-row fit vs SVD fit (Train-Test Gap)
+**Training:** $\text{fit}_{\text{cur}} = \text{A-row fit} + \beta$
+**Inference:** $\text{fit}_{\text{cur}} = \text{A-row fit}$ (no bias)
+Hai formula đo fit trên **hai thang đo khác nhau**:
+- **A-row fit** (current task): $\frac{\sum_i (a_i \cdot h)^2}{r \|h\|^2}$ — uniform weighting
+- **SVD fit** (old tasks): $\frac{\sum_i \sigma_i^2 (v_i \cdot h)^2}{\sum_i \sigma_i^2 \|h\|^2}$ — $\sigma^2$-weighted
+Sau null-space projection, A rows bị constrained vào subspace hẹp → A-row fit **hệ thống thấp hơn** SVD fit → old tasks luôn thắng routing tại inference.
+### 2.3 BUG THỨ BA: Threshold quá thấp (0.980 vs ROOT 0.995)
+- threshold = 0.980 → mỗi task chiếm **nhiều hơn** null-space
+- Sau 7 tasks: null-space còn lại cho task 8 rất hẹp
+- A_8 rows bị project vào null-space nhỏ → A-row fit cực thấp
+- ROOT dùng 0.995 → mỗi task chiếm ít null-space hơn → duy trì capacity cho tasks sau
+---
+## 3. Phân tích lý thuyết (Theory-Backed)
+### 3.1 Softmax Competition Bias (Information Theory)
+Softmax + constant bias vi phạm **principle of maximum entropy** khi số task tăng. Với $n$ tasks và fit gần nhau, softmax converge về phân phối uniform $1/n$ (maximum entropy). Constant bias $\beta$ không đủ để chống lại entropy này khi $n$ lớn.
+**Adaptive bias derivation:** Muốn current task đạt weight $\alpha$ cố định:
+$$\alpha = \frac{e^{(f + \beta)/T}}{e^{(f + \beta)/T} + (n-1)e^{f/T}}$$
+Giải cho $\beta$:
+$$\boxed{\beta = T \cdot \ln\left(\frac{\alpha(n-1)}{1-\alpha}\right)}$$
+Với $\alpha = 0.8$, $T = 1.0$:
+- n=2: $\beta$ = 1.39 → w = 80%
+- n=8: $\beta$ = 3.33 → w = 80%
+- n=15: $\beta$ = 4.03 → w = 80%
+**Kết nối paper**: Tương tự "bias correction" trong Adam optimizer — bias phải thay đổi theo thời gian để duy trì tính chất thống kê mong muốn.
+### 3.2 Rayleigh Quotient Symmetry (Linear Algebra)
+Fit formula hiện tại vi phạm **measurement symmetry**: current task và old tasks dùng metric khác nhau. Trong Grassmannian geometry, khoảng cách giữa hai subspace phải dùng cùng một metric.
+**Weighted Rayleigh quotient** (chuẩn cho cả hai):
+$$\text{fit}_k(h) = \frac{\sum_{i=1}^r \sigma_{k,i}^2 (v_{k,i} \cdot h)^2}{\sum_{i=1}^r \sigma_{k,i}^2 \cdot \|h\|^2}$$
+Tại inference, current task cũng phải dùng SVD-based fit (SVD có sẵn vì B ≠ 0 sau training).
+**Kết nối paper**: Principal angle theory (Björck & Golub, 1973) — khoảng cách giữa subspaces phải đo bằng canonical angles, tương đương $\sigma$-weighted Rayleigh quotient.
+### 3.3 Null-Space Capacity Bound (GPM Theory)
+Từ GPM (Saha et al., 2021): với threshold $\tau$, mỗi task chiếm $\leq r(1-\tau)$ dimensions. Capacity cho $n$ tasks:
+$$n_{\max} = \left\lfloor \frac{d}{r(1-\tau)} \right\rfloor$$
+Với d=512, r=8:
+- $\tau$ = 0.995 → capacity = 512/(8×0.005) = **12,800 tasks** (rất dư)
+- $\tau$ = 0.980 → capacity = 512/(8×0.020) = **3,200 tasks** (vẫn dư nhưng aggressive hơn)
+Threshold 0.995 bảo vệ nhiều capacity hơn cho tasks sau.
+---
+## 4. Kế hoạch sửa: SpecRoute V3
+### Fix 1: Adaptive Training Bias (Bắt buộc)
+```
+β = T · ln(α · (n_old) / (1-α))    khi n_old ≥ 1
+β = 0                                khi n_old = 0
+```
+- `n_old = len(self.spectral_signatures)` — tự động từ số signatures đã load
+- `α = target_routing_alpha` — config parameter, default 0.8
+- Đảm bảo w_cur ≈ 80% bất kể số task
+### Fix 2: Symmetric Inference Routing (Bắt buộc)
+- **Training**: Giữ A-row fit + adaptive bias (cold-start compatible)
+- **Inference**: Tính SVD(B@A) cho current task → dùng SVD fit cho TẤT CẢ tasks
+- Method: `prepare_inference_routing()` — gọi 1 lần trước inference
+- Loại bỏ hoàn toàn asymmetry A-row vs SVD
+### Fix 3: Threshold = 0.995 (Match ROOT)
+- Chỉ thay đổi trong shell script
+- Giảm null-space consumption per task
+- Bảo toàn capacity cho tasks sau
+### Không thêm gì khác
+- Không thêm KL replay (vi phạm zero-replay settings)
+- Không thêm learned routing parameters (mất novelty parameter-free)
+- Không thay đổi optimizer/lr/scheduler
+- Tôn trọng nguyên tắc: "chỉ cải thiện implement, không over-engineer"
+---
+## 5. Code Changes Cụ Thể
+### File 1: `t5_specroute.py`
+1. Thêm `prepare_inference_routing()` method vào T5Stack
+2. Sửa `compute_spectral_routing()`:
+   - Training: A-row fit + `adaptive_training_bias` (computed from α and n_old)
+   - Inference: SVD fit từ `_current_task_svd` (precomputed)
+3. Thêm property `adaptive_training_bias`
+### File 2: `run_t5.py`
+1. Thêm `target_routing_alpha` vào prompt_config
+2. Gọi `model.encoder.prepare_inference_routing()` trước inference
+### File 3: `gen_script_long_order3_t5_small_specroute_v3.sh`
+1. `--threshold 0.995`
+2. `--transthreshold 0.995`
+3. `--target_routing_alpha 0.8`
+4. Output dir: `specroute_v3`