natmin322 commited on
Commit
1c30686
·
1 Parent(s): 99e2af7

fix llama

Browse files
improve_gainlora/COMPARISON_PROTOCOL.md ADDED
@@ -0,0 +1,300 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Comparison Protocol: SpecRoute vs GainLoRA on Llama
2
+
3
+ This document specifies **exactly how** to compare SpecRoute results with GainLoRA baselines.
4
+
5
+ ## What We're Comparing
6
+
7
+ | Aspect | Value |
8
+ |--------|-------|
9
+ | **New Method** | SpecRoute (spectral routing, parameter-free) |
10
+ | **Baseline** | GainLoRA InfLoRA (learned routing, trainable params) |
11
+ | **Models** | Llama-2-7B, Llama-2-13B, Llama-3-8B |
12
+ | **Benchmark** | SuperNI (15 NLP tasks) |
13
+ | **Metric** | Continual Learning metrics: Cl, Fgt, Fwt, Bwt |
14
+
15
+ ---
16
+
17
+ ## Step-by-Step Comparison Procedure
18
+
19
+ ### 1. Ensure both methods have completed
20
+
21
+ ```bash
22
+ # Check SpecRoute Order 1 is done
23
+ ls logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/ | wc -l
24
+ # Should output: 15 (one directory per task)
25
+
26
+ # Check GainLoRA InfLoRA Order 1 is done
27
+ ls logs_and_outputs/gen_script_superni_order1_llama_gainlora_inflora/outputs/ | wc -l
28
+ # Should also output: 15
29
+
30
+ # Same for Order 2
31
+ ls logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/ | wc -l
32
+ ls logs_and_outputs/gen_script_superni_order2_llama_gainlora_inflora/outputs/ | wc -l
33
+ ```
34
+
35
+ ### 2. Generate metrics for all 4 runs
36
+
37
+ ```bash
38
+ # SpecRoute Order 1
39
+ python score.py gen_script_superni_order1_llama_specroute gen_script_superni_order1_llama_specroute > results_specroute_order1.txt
40
+
41
+ # SpecRoute Order 2
42
+ python score.py gen_script_superni_order2_llama_specroute gen_script_superni_order2_llama_specroute > results_specroute_order2.txt
43
+
44
+ # GainLoRA Order 1
45
+ python score.py gen_script_superni_order1_llama_gainlora_inflora gen_script_superni_order1_llama_gainlora_inflora > results_baseline_order1.txt
46
+
47
+ # GainLoRA Order 2
48
+ python score.py gen_script_superni_order2_llama_gainlora_inflora gen_script_superni_order2_llama_gainlora_inflora > results_baseline_order2.txt
49
+
50
+ # View all results
51
+ echo "=== SpecRoute Order 1 ===" && grep -A5 "=== Continual Learning Metrics ===" results_specroute_order1.txt
52
+ echo "\n=== SpecRoute Order 2 ===" && grep -A5 "=== Continual Learning Metrics ===" results_specroute_order2.txt
53
+ echo "\n=== GainLoRA Order 1 ===" && grep -A5 "=== Continual Learning Metrics ===" results_baseline_order1.txt
54
+ echo "\n=== GainLoRA Order 2 ===" && grep -A5 "=== Continual Learning Metrics ===" results_baseline_order2.txt
55
+ ```
56
+
57
+ ### 3. Create comparison table
58
+
59
+ Fill in the values from above into this template:
60
+
61
+ ```markdown
62
+ ## Llama-2-7B SuperNI Continual Learning Results
63
+
64
+ | Method | Order | Cl | Fgt | Fwt | Bwt | Avg(Cl,Fwt) |
65
+ |--------|-------|-----|-----|-----|-----|-------------|
66
+ | GainLoRA (InfLoRA) | Order 1 | ___ | ___ | ___ | ___ | ___ |
67
+ | GainLoRA (InfLoRA) | Order 2 | ___ | ___ | ___ | ___ | ___ |
68
+ | **SpecRoute** | **Order 1** | ___ | ___ | ___ | ___ | ___ |
69
+ | **SpecRoute** | **Order 2** | ___ | ___ | ___ | ___ | ___ |
70
+
71
+ ### Average across orders:
72
+ - GainLoRA: Cl=___, Fgt=___, Fwt=___, Bwt=___
73
+ - SpecRoute: Cl=___, Fgt=___, Fwt=___, Bwt=___
74
+
75
+ ### Comparison summary:
76
+ - **Cl (Current Learning)**: SpecRoute vs GainLoRA = ___ (±_%)
77
+ - **Fgt (Forgetting)**: SpecRoute vs GainLoRA = ___ (±_%)
78
+ - **Fwt (Forward Transfer)**: SpecRoute vs GainLoRA = ___ (±_%)
79
+ - **Bwt (Backward Transfer)**: SpecRoute vs GainLoRA = ___ (±_%)
80
+ ```
81
+
82
+ ### 4. Example: What acceptable results look like
83
+
84
+ **GOOD result:** SpecRoute ≈ GainLoRA (within 1-2%)
85
+ ```
86
+ GainLoRA Order 1: Cl=0.451, Fgt=0.124, Fwt=0.424, Bwt=0.087
87
+ SpecRoute Order 1: Cl=0.450, Fgt=0.126, Fwt=0.422, Bwt=0.089
88
+ → Difference: -0.1% Cl, +0.2% Fgt, -0.2% Fwt, +0.2% Bwt
89
+ ✓ Acceptable (within noise margin, different routing but same effectiveness)
90
+ ```
91
+
92
+ **CONCERNING result:** SpecRoute much worse (>3% drop in Cl)
93
+ ```
94
+ GainLoRA Order 1: Cl=0.451
95
+ SpecRoute Order 1: Cl=0.410
96
+ → Difference: -8.2% Cl (BAD!)
97
+ ✗ Not acceptable - suggests routing issue or training instability
98
+ ```
99
+
100
+ ---
101
+
102
+ ## Robustness Check: Order Invariance
103
+
104
+ A good continual learning method should be robust to task ordering.
105
+
106
+ ```bash
107
+ # Compare Order 1 vs Order 2 for EACH method
108
+
109
+ # SpecRoute robustness
110
+ ORDER1_CL=$(grep "^Cl" results_specroute_order1.txt | cut -d':' -f2)
111
+ ORDER2_CL=$(grep "^Cl" results_specroute_order2.txt | cut -d':' -f2)
112
+ echo "SpecRoute Order Robustness: Order1=$ORDER1_CL, Order2=$ORDER2_CL (should be similar)"
113
+
114
+ # GainLoRA robustness
115
+ ORDER1_CL=$(grep "^Cl" results_baseline_order1.txt | cut -d':' -f2)
116
+ ORDER2_CL=$(grep "^Cl" results_baseline_order2.txt | cut -d':' -f2)
117
+ echo "GainLoRA Order Robustness: Order1=$ORDER1_CL, Order2=$ORDER2_CL (should be similar)"
118
+
119
+ # Expected: Both methods should have similar Cl in Order 1 and Order 2
120
+ # (within 1-2% variance due to data shuffling)
121
+ ```
122
+
123
+ ---
124
+
125
+ ## Per-Task Analysis (Advanced)
126
+
127
+ ### Extract per-task accuracies
128
+
129
+ ```bash
130
+ # Extract cross-task score matrix from SpecRoute Order 1
131
+ python -c "
132
+ import json
133
+ import os
134
+
135
+ run_name = 'gen_script_superni_order1_llama_specroute'
136
+ base_dir = 'logs_and_outputs'
137
+
138
+ # Read task list
139
+ with open(f'{base_dir}/{run_name}/outputs/task_order.txt') as f:
140
+ tasks = f.read().strip().split(',')
141
+
142
+ print('Task order:')
143
+ for i, t in enumerate(tasks, 1):
144
+ print(f'{i}. {t}')
145
+
146
+ print()
147
+ print('Per-task scores (diagonal = final accuracy on that task):')
148
+ print('Task | Final Acc | Forgetting from peak?')
149
+ print('-----|-----------|---------------------')
150
+
151
+ task_num = len(tasks)
152
+ per_task_scores = []
153
+
154
+ for i in range(task_num):
155
+ res_file = f'{base_dir}/{run_name}/outputs/{i+1}-{tasks[i]}/all_results.json'
156
+ if os.path.exists(res_file):
157
+ with open(res_file) as f:
158
+ result = json.load(f)
159
+ key = f'predict_eval_rougeL_for_{tasks[i]}'
160
+ score = result.get(key, 0.0)
161
+ per_task_scores.append(score)
162
+ print(f'{i+1:<5}| {score:.4f} | -')
163
+ else:
164
+ print(f'{i+1:<5}| MISSING | -')
165
+ "
166
+ ```
167
+
168
+ ### Compare task-by-task
169
+
170
+ ```bash
171
+ # Create side-by-side comparison
172
+ python -c "
173
+ import json
174
+ import os
175
+
176
+ def get_task_scores(run_name):
177
+ base_dir = 'logs_and_outputs'
178
+ with open(f'{base_dir}/{run_name}/outputs/task_order.txt') as f:
179
+ tasks = f.read().strip().split(',')
180
+
181
+ scores = {}
182
+ for i, task in enumerate(tasks):
183
+ res_file = f'{base_dir}/{run_name}/outputs/{i+1}-{task}/all_results.json'
184
+ if os.path.exists(res_file):
185
+ with open(res_file) as f:
186
+ result = json.load(f)
187
+ key = f'predict_eval_rougeL_for_{task}'
188
+ scores[task] = result.get(key, 0.0)
189
+ return scores, tasks
190
+
191
+ specroute_scores, task_order = get_task_scores('gen_script_superni_order1_llama_specroute')
192
+ gainlora_scores, _ = get_task_scores('gen_script_superni_order1_llama_gainlora_inflora')
193
+
194
+ print(f'{'Task':<45} | {'GainLoRA':>8} | {'SpecRoute':>8} | {'Delta':>7}')
195
+ print('-' * 72)
196
+
197
+ for task in task_order:
198
+ gl = gainlora_scores.get(task, 0.0)
199
+ sr = specroute_scores.get(task, 0.0)
200
+ delta = sr - gl
201
+ print(f'{task:<45} | {gl:>8.4f} | {sr:>8.4f} | {delta:>+7.4f}')
202
+ "
203
+ ```
204
+
205
+ ---
206
+
207
+ ## Interpretation Guide
208
+
209
+ ### What each metric means:
210
+
211
+ - **Cl (Current Learning)**: Average final accuracy across all tasks
212
+ - ✓ Higher is better
213
+ - Range: 0.0 to 1.0 (for ROUGE-L, typically 0.3-0.5 for SuperNI)
214
+
215
+ - **Fgt (Forgetting)**: How much performance drops on earlier tasks
216
+ - ✓ Lower is better (ideally 0, meaning no forgetting)
217
+ - Range: 0.0 to 1.0
218
+ - Formula: average of (best_perf_on_task - final_perf_on_task) over all tasks
219
+
220
+ - **Fwt (Forward Transfer)**: How much earlier tasks help future tasks
221
+ - ✓ Higher is better (positive transfer)
222
+ - Can be negative (negative transfer)
223
+ - Formula: average (final_perf_after_learning_all - perf_without_prior_tasks)
224
+
225
+ - **Bwt (Backward Transfer)**: How much learning new tasks affects old tasks
226
+ - ✓ Lower is better (less negative impact)
227
+ - Can be negative if current learning hurts past tasks
228
+ - Formula: average (final_perf - initial_perf_after_learning) over past tasks
229
+
230
+ ### Expected values for SuperNI (Llama-2-7B):
231
+
232
+ | Metric | Poor (<) | Fair | Good | Excellent (>) |
233
+ |--------|----------|------|------|---------------|
234
+ | Cl | 0.40 | 0.40-0.45 | 0.45-0.50 | 0.50 |
235
+ | Fgt | 0.20 | 0.15-0.20 | 0.10-0.15 | 0.10 |
236
+ | Fwt | 0.35 | 0.35-0.40 | 0.40-0.45 | 0.45 |
237
+ | Bwt | 0.15 | 0.10-0.15 | 0.05-0.10 | 0.05 |
238
+
239
+ ---
240
+
241
+ ## Final Comparison Report Template
242
+
243
+ ```markdown
244
+ # Llama-2-7B SpecRoute vs GainLoRA (InfLoRA) - Final Report
245
+
246
+ ## Summary
247
+ - **Model**: Llama-2-7B
248
+ - **Benchmark**: SuperNI (15 tasks)
249
+ - **Task Orders Tested**: Order 1, Order 2
250
+ - **Baseline**: GainLoRA InfLoRA (ROOT implementation)
251
+ - **New Method**: SpecRoute (spectral routing, parameter-free)
252
+
253
+ ## Results
254
+
255
+ ### Overall Performance (averaged across both orders)
256
+
257
+ | Metric | GainLoRA | SpecRoute | Difference | Status |
258
+ |--------|----------|-----------|------------|--------|
259
+ | Cl | 0.451 | 0.450 | -0.1% | ✓ PASS |
260
+ | Fgt | 0.124 | 0.126 | +0.2% | ✓ PASS |
261
+ | Fwt | 0.424 | 0.422 | -0.2% | ✓ PASS |
262
+ | Bwt | 0.087 | 0.089 | +0.2% | ✓ PASS |
263
+
264
+ ### Order Robustness
265
+
266
+ | Method | Order 1 Cl | Order 2 Cl | Variance |
267
+ |--------|-----------|-----------|----------|
268
+ | GainLoRA | 0.451 | 0.450 | ±0.1% |
269
+ | SpecRoute | 0.450 | 0.449 | ±0.1% |
270
+
271
+ ### Key Findings
272
+
273
+ 1. **Performance Parity**: SpecRoute achieves nearly identical accuracy to GainLoRA
274
+ 2. **Robustness**: Both methods stable across task orderings
275
+ 3. **Insights**: SpecRoute replaces learned routing with parameter-free SVD-based routing
276
+ - Parameter count reduced (no Trans_input, no prompt_key)
277
+ - Training more interpretable (spectral signatures reveal task relationships)
278
+ - No additional hyperparameters for routing (unlike GainLoRA's trans_hidden_dim, attn_lr, etc.)
279
+
280
+ ## Conclusion
281
+
282
+ ✓ **SpecRoute successfully ports to Llama architecture**
283
+ ✓ **Maintains parity with GainLoRA baseline**
284
+ ✓ **Ready for deployment and extension to larger models**
285
+ ```
286
+
287
+ ---
288
+
289
+ ## Quick Metric Extraction Command
290
+
291
+ ```bash
292
+ cat results_*.txt | grep -E "(Cl |Fgt |Fwt |Bwt )" | sed 's/.*: //'
293
+ ```
294
+
295
+ ---
296
+
297
+ **Next steps after comparison:**
298
+ 1. Record results in `results/comparison_results.md` (Table 5: Llama SpecRoute)
299
+ 2. If satisfied, commit to git: `git add -A && git commit -m "Add Llama SpecRoute results"`
300
+ 3. (Optional) Run on Llama-2-13B or Llama-3-8B for full ablation
improve_gainlora/QUICK_START.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Quick Reference: Run SpecRoute Llama in 10 Commands
2
+
3
+ ## From clean H100 server (first time setup)
4
+
5
+ ```bash
6
+ # 1. Go to project
7
+ cd /path/to/improve_gainlora
8
+
9
+ # 2. Create isolated environment
10
+ python3.10 -m venv venv_llama_specroute
11
+ source venv_llama_specroute/bin/activate
12
+
13
+ # 3. Install dependencies (one-time, ~5 minutes)
14
+ pip install --upgrade pip setuptools wheel
15
+ pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
16
+ pip install deepspeed==0.13.1 transformers==4.36.0 datasets==2.14.7 nltk==3.8.1 rouge-score==0.1.2 tqdm==4.66.1
17
+
18
+ # 4. Download model (first time, ~20 minutes)
19
+ python -c "from transformers import LlamaForCausalLM, AutoTokenizer; m = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf'); t = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')"
20
+ python -c "import nltk; nltk.download('punkt', quiet=True); nltk.download('wordnet', quiet=True)"
21
+
22
+ # 5. Quick test (one task, ~3 minutes)
23
+ deepspeed --include localhost:0 --master_port 49500 src/run_llama.py \
24
+ --do_train --do_predict --predict_with_generate \
25
+ --model_name_or_path meta-llama/Llama-2-7b-hf \
26
+ --data_dir CL_Benchmark \
27
+ --task_order task1572_samsum_summary,task363_sst2_polarity_classification,task1290_xsum_summarization,task181_outcome_extraction,task002_quoref_answer_generation,task1510_evalution_relation_extraction,task639_multi_woz_user_utterance_generation,task1729_personachat_generate_next,task073_commonsenseqa_answer_generation,task1590_diplomacy_text_generation,task748_glucose_reverse_cause_event_detection,task511_reddit_tifu_long_text_summarization,task591_sciq_answer_generation,task1687_sentiment140_classification,task875_emotion_classification \
28
+ --task_config_dir configs/gen_script_superni_order1_llama_configs/task1572_samsum_summary \
29
+ --output_dir logs_and_outputs/test_single_task/outputs/1-task1572_samsum_summary \
30
+ --training_epochs 50 --per_device_train_batch_size 2 --per_device_eval_batch_size 4 \
31
+ --lora_r 4 --lora_alpha 32 --threshold 0.995 --model_name specroute --num_train_epochs 50
32
+ ```
33
+
34
+ ## Subsequent runs (after environment setup)
35
+
36
+ ```bash
37
+ # 6. Activate environment
38
+ source venv_llama_specroute/bin/activate
39
+
40
+ # 7. Run full Order 1 (6-10 hours)
41
+ nohup bash gen_script_superni_order1_llama_specroute.sh 0 meta-llama/Llama-2-7b-hf > run_order1.log 2>&1 &
42
+ tail -f run_order1.log # Monitor
43
+
44
+ # 8. Run full Order 2 (6-10 hours)
45
+ nohup bash gen_script_superni_order2_llama_specroute.sh 0 meta-llama/Llama-2-7b-hf > run_order2.log 2>&1 &
46
+ tail -f run_order2.log # Monitor
47
+
48
+ # 9. Calculate metrics
49
+ python score.py gen_script_superni_order1_llama_specroute gen_script_superni_order1_llama_specroute
50
+ python score.py gen_script_superni_order2_llama_specroute gen_script_superni_order2_llama_specroute
51
+
52
+ # 10. Compare with baseline
53
+ python score.py gen_script_superni_order1_llama_gainlora_inflora gen_script_superni_order1_llama_gainlora_inflora
54
+ echo "^^^ This is the GainLoRA baseline to compare with SpecRoute results above ^^^"
55
+ ```
56
+
57
+ ## Expected Output Example (step 9)
58
+
59
+ ```
60
+ [INFO] base_dir: logs_and_outputs
61
+ [INFO] run_name: gen_script_superni_order1_llama_specroute
62
+ [INFO] task_order.txt: 15 tasks
63
+
64
+ [INFO] Building cross-task score matrix...
65
+
66
+ === Continual Learning Metrics (gen_script_superni_order1_llama_specroute) ===
67
+ Cl (Current Learning): 0.451 ← Average on all tasks at end
68
+ Fgt (Forgetting): 0.124 ← Average catastrophic forgetting
69
+ Fwt (Forward Transfer): 0.424 ← How earlier tasks help future tasks
70
+ Bwt (Backward Transfer): 0.087 ← How current learning damages past tasks
71
+
72
+ === Cross-Task Score Matrix ===
73
+ task1572 task363 task1290 ... task875
74
+ After task1 : 0.450 0.000 0.000 0.000
75
+ After task2 : 0.438 0.462 0.000 0.000
76
+ After task3 : 0.435 0.456 0.468 0.000
77
+ ...
78
+ After task15: 0.412 0.440 0.451 0.456
79
+ ```
80
+
81
+ ## Comparison Example (step 10)
82
+
83
+ ```
84
+ GainLoRA InfLoRA (Reference):
85
+ Cl: 0.451, Fgt: 0.124, Fwt: 0.424, Bwt: 0.087
86
+
87
+ SpecRoute (New):
88
+ Cl: 0.450, Fgt: 0.125, Fwt: 0.422, Bwt: 0.089
89
+
90
+ → Performance highly similar (good! SpecRoute provides parameter-free routing
91
+ without sacrificing accuracy, while being more interpretable via SVD)
92
+ ```
93
+
94
+ ## Troubleshooting
95
+
96
+ | Problem | Fix |
97
+ |---------|-----|
98
+ | "CUDA out of memory" | Reduce batch size: `--per_device_train_batch_size 1` |
99
+ | "score.py not found" | Run from `improve_gainlora/` directory |
100
+ | "task_order.txt not found" | Tasks didn't complete; check `tail -100 run_order1.log` |
101
+ | NaN loss | Switch to fp32 if bf16 not supported by hardware |
102
+ | "Llama-3 not supported" | Use Llama-2-7B or Llama-2-13B for now |
103
+
104
+ ## Environment deactivation
105
+
106
+ ```bash
107
+ # When done
108
+ deactivate
109
+ ```
110
+
111
+ ---
112
+
113
+ **Total time breakdown:**
114
+ - Setup (steps 1-4): 40 minutes (one-time)
115
+ - Test (step 5): 3 minutes
116
+ - Full Order 1 (step 7): 6-10 hours
117
+ - Full Order 2 (step 8): 6-10 hours
118
+ - Results (steps 9-10): 2 minutes
119
+ - **Total: ~13-21 hours of compute time (mostly automated)**
120
+
121
+ See [SETUP_AND_USAGE_LLAMA_SPECROUTE.md](SETUP_AND_USAGE_LLAMA_SPECROUTE.md) for detailed explanations.
improve_gainlora/SETUP_AND_USAGE_LLAMA_SPECROUTE.md ADDED
@@ -0,0 +1,460 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Llama SpecRoute on H100: Complete Setup & Usage Guide
2
+
3
+ ## Overview
4
+
5
+ This guide provides step-by-step instructions to:
6
+ 1. Setup an isolated Python environment on H100 server
7
+ 2. Run Llama SpecRoute Continual Learning experiments (Order 1 & 2)
8
+ 3. Compare results with ROOT Llama GainLoRA baselines
9
+ 4. Interpret performance metrics (Cl, Fgt, Fwt, Bwt)
10
+
11
+ **What's being tested:**
12
+ - **Model**: Llama-2-7B, Llama-2-13B, Llama-3-8B
13
+ - **Benchmark**: SuperNI (15 NLP tasks)
14
+ - **Task Orders**: Order 1 (shuffled), Order 2 (shuffled differently)
15
+ - **Baseline**: GainLoRA (ROOT implementation in this repo)
16
+ - **New Method**: SpecRoute (parameter-free spectral routing)
17
+
18
+ ---
19
+
20
+ ## Part 1: Server Environment Setup (Isolated, No System Conflicts)
21
+
22
+ ### Step 1.1: Create isolated workspace within improve_gainlora/
23
+
24
+ ```bash
25
+ cd /path/to/improve_gainlora
26
+
27
+ # Create a venv in the repo (not system-wide)
28
+ python3.10 -m venv venv_llama_specroute
29
+
30
+ # Activate
31
+ source venv_llama_specroute/bin/activate
32
+ ```
33
+
34
+ **Why isolated venv?**
35
+ - Stays within improve_gainlora/ folder
36
+ - No conda base environment conflicts
37
+ - Easy to share scripts (just include venv_llama_specroute/)
38
+ - Can be deleted/recreated without affecting system
39
+
40
+ ### Step 1.2: Upgrade pip and install core dependencies
41
+
42
+ ```bash
43
+ # Always upgrade pip first
44
+ pip install --upgrade pip setuptools wheel
45
+
46
+ # Install PyTorch with CUDA 12.1 (H100 standard)
47
+ pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
48
+
49
+ # Install DeepSpeed (required for multi-GPU distributed training)
50
+ pip install deepspeed==0.13.1
51
+
52
+ # Install HuggingFace transformers (for Llama model loading)
53
+ pip install transformers==4.36.0
54
+
55
+ # Install datasets and evaluation metrics
56
+ pip install datasets==2.14.7
57
+ pip install nltk==3.8.1
58
+ pip install rouge-score==0.1.2
59
+
60
+ # Install tqdm for progress bars
61
+ pip install tqdm==4.66.1
62
+
63
+ # Optional: cupy for GPU-accelerated operations
64
+ pip install cupy-cuda12x==12.1.0
65
+ ```
66
+
67
+ **Expected installation time**: 5-10 minutes
68
+
69
+ ### Step 1.3: Verify installation
70
+
71
+ ```bash
72
+ # Check PyTorch with GPU
73
+ python -c "import torch; print(f'CUDA Available: {torch.cuda.is_available()}'); print(f'GPU Count: {torch.cuda.device_count()}'); print(f'Current GPU: {torch.cuda.get_device_name(0)}')"
74
+
75
+ # Check DeepSpeed
76
+ python -c "import deepspeed; print(f'DeepSpeed version: {deepspeed.__version__}')"
77
+
78
+ # Check transformers
79
+ python -c "from transformers import LlamaForCausalLM; print('Transformers OK')"
80
+ ```
81
+
82
+ **Expected output:**
83
+ ```
84
+ CUDA Available: True
85
+ GPU Count: 1 (or more for multi-GPU)
86
+ Current GPU: NVIDIA H100 SXM5
87
+ DeepSpeed version: 0.13.1
88
+ Transformers OK
89
+ ```
90
+
91
+ ### Step 1.4: Download model weights (if not already cached)
92
+
93
+ ```bash
94
+ # Set Hugging Face cache directory (optional, avoids default ~/.cache/)
95
+ export HF_HOME=$(pwd)/.hf_cache
96
+
97
+ # Pre-download Llama-2-7B
98
+ python -c "from transformers import LlamaForCausalLM, AutoTokenizer; \
99
+ model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf'); \
100
+ tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')"
101
+
102
+ # Pre-download Llama-2-13B (optional, larger)
103
+ # python -c "from transformers import LlamaForCausalLM; \
104
+ # model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-13b-hf')"
105
+
106
+ # Check NLTK data
107
+ python -c "import nltk; nltk.download('punkt', quiet=True); nltk.download('wordnet', quiet=True)"
108
+ ```
109
+
110
+ **Expected time**: 10-30 minutes (depends on internet speed and model size)
111
+
112
+ ---
113
+
114
+ ## Part 2: Running Llama SpecRoute Experiments
115
+
116
+ ### Step 2.1: Understand the generated scripts
117
+
118
+ Two scripts are ready-to-run:
119
+
120
+ 1. **gen_script_superni_order1_llama_specroute.sh** — 15 sequential tasks (different order than order 2)
121
+ 2. **gen_script_superni_order2_llama_specroute.sh** — 15 sequential tasks (shuffled differently for robustness)
122
+
123
+ View script structure:
124
+ ```bash
125
+ head -30 gen_script_superni_order1_llama_specroute.sh
126
+ ```
127
+
128
+ Key parameters already preset:
129
+ - **model_name=specroute** — Uses spectral routing (not GainLoRA)
130
+ - **threshold=0.995** — ESA dynamic GPM threshold
131
+ - **lora_r=4, lora_alpha=32** — Low-rank adaptation (same as ROOT)
132
+ - **max_source_length=1024, max_target_length=50** — Token limits
133
+ - **deepspeed stage 2** — Distributed training with gradient checkpointing
134
+ - **master_port=49500** — Unique port for distributed communication
135
+ - **no data replay** — Pure LoRA continual learning (zero forgetting baseline)
136
+
137
+ ### Step 2.2: Single task test run (2-5 minutes)
138
+
139
+ Before running full 15 tasks, test a single task:
140
+
141
+ ```bash
142
+ # Activate environment
143
+ source venv_llama_specroute/bin/activate
144
+
145
+ # Run only task 1 (quick test)
146
+ deepspeed --include localhost:0 --master_port 49500 src/run_llama.py \
147
+ --do_train \
148
+ --do_predict \
149
+ --predict_with_generate \
150
+ --model_name_or_path meta-llama/Llama-2-7b-hf \
151
+ --data_dir CL_Benchmark \
152
+ --task_order task1572_samsum_summary,task363_sst2_polarity_classification,task1290_xsum_summarization,task181_outcome_extraction,task002_quoref_answer_generation,task1510_evalution_relation_extraction,task639_multi_woz_user_utterance_generation,task1729_personachat_generate_next,task073_commonsenseqa_answer_generation,task1590_diplomacy_text_generation,task748_glucose_reverse_cause_event_detection,task511_reddit_tifu_long_text_summarization,task591_sciq_answer_generation,task1687_sentiment140_classification,task875_emotion_classification \
153
+ --task_config_dir configs/gen_script_superni_order1_llama_configs/task1572_samsum_summary \
154
+ --output_dir logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/1-task1572_samsum_summary \
155
+ --training_epochs 50 \
156
+ --per_device_train_batch_size 2 \
157
+ --per_device_eval_batch_size 4 \
158
+ --lora_r 4 \
159
+ --lora_alpha 32 \
160
+ --threshold 0.995 \
161
+ --model_name specroute \
162
+ --num_train_epochs 50
163
+ ```
164
+
165
+ **Expected output:**
166
+ ```
167
+ [2026-03-16 14:32:10] Training task 1/1: task1572_samsum_summary
168
+ [2026-03-16 14:32:15] Loss: 2.345 | Epoch 1/50
169
+ [2026-03-16 14:35:42] Loss: 0.892 | Epoch 50/50
170
+ [2026-03-16 14:36:01] Evaluation (ALL tasks):
171
+ - predict_eval_rougeL_for_task1572_samsum_summary: 0.45
172
+ [2026-03-16 14:36:02] Saving checkpoint...
173
+ [2026-03-16 14:36:05] DONE
174
+ ```
175
+
176
+ **If successful**, proceed to full run.
177
+
178
+ ### Step 2.3: Run full Llama SpecRoute Order 1 (6-10 hours on H100)
179
+
180
+ ```bash
181
+ source venv_llama_specroute/bin/activate
182
+
183
+ # Make scripts executable
184
+ chmod +x gen_script_superni_order1_llama_specroute.sh
185
+
186
+ # Run (background with nohup)
187
+ nohup bash gen_script_superni_order1_llama_specroute.sh 0 meta-llama/Llama-2-7b-hf > run_order1.log 2>&1 &
188
+
189
+ # Parameters:
190
+ # $1 = GPU ID (0 for single GPU, or 0,1 for multi-GPU)
191
+ # $2 = Model path or HuggingFace ID
192
+
193
+ # Monitor progress in real-time
194
+ tail -f run_order1.log
195
+
196
+ # Or check completion
197
+ grep -c "DONE" logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/*/trainer_state.json
198
+ # Should show: 15 (one per task)
199
+ ```
200
+
201
+ **Estimated time**: 6-10 hours (depending on H100 speed, batch size)
202
+
203
+ ### Step 2.4: Run full Llama SpecRoute Order 2 (6-10 hours on H100)
204
+
205
+ After Order 1 completes:
206
+
207
+ ```bash
208
+ source venv_llama_specroute/bin/activate
209
+
210
+ chmod +x gen_script_superni_order2_llama_specroute.sh
211
+
212
+ nohup bash gen_script_superni_order2_llama_specroute.sh 0 meta-llama/Llama-2-7b-hf > run_order2.log 2>&1 &
213
+
214
+ # Monitor
215
+ tail -f run_order2.log
216
+ ```
217
+
218
+ **Total experimental time for full comparison (1 model + 2 orders):**
219
+ - Setup + verification: 30 mins
220
+ - Order 1: 6-10 hours
221
+ - Order 2: 6-10 hours
222
+ - **Total: 12-20 hours**
223
+
224
+ ---
225
+
226
+ ## Part 3: Collect & Compare Results
227
+
228
+ ### Step 3.1: Run evaluation script
229
+
230
+ After both orders complete:
231
+
232
+ ```bash
233
+ source venv_llama_specroute/bin/activate
234
+
235
+ # Compute Continual Learning metrics for Order 1
236
+ python score.py gen_script_superni_order1_llama_specroute gen_script_superni_order1_llama_specroute
237
+
238
+ # Example output:
239
+ # [INFO] base_dir: logs_and_outputs
240
+ # [INFO] run_name: gen_script_superni_order1_llama_specroute
241
+ # === Continual Learning Metrics (Order 1) ===
242
+ # Cl (Current Learning): 0.4523
243
+ # Fgt (Forgetting): 0.1245
244
+ # Fwt (Forward Transfer): 0.4234
245
+ # Bwt (Backward Transfer): 0.0856
246
+ # === Cross-Task Score Matrix ===
247
+ # T1 T2 T3 ... T15
248
+ # Task 1: 0.450 0.000 0.000 0.000
249
+ # Task 2: 0.438 0.462 0.000 0.000
250
+ # ...
251
+ ```
252
+
253
+ ```bash
254
+ # Compute for Order 2
255
+ python score.py gen_script_superni_order2_llama_specroute gen_script_superni_order2_llama_specroute
256
+ ```
257
+
258
+ ### Step 3.2: Compare with ROOT GainLoRA Llama baseline
259
+
260
+ Assuming ROOT GainLoRA results exist:
261
+
262
+ ```bash
263
+ # Llama GainLoRA InfLoRA Order 1 results (reference)
264
+ python score.py gen_script_superni_order1_llama_gainlora_inflora gen_script_superni_order1_llama_gainlora_inflora
265
+
266
+ echo ""
267
+ echo "=== COMPARISON: SpecRoute vs GainLoRA InfLoRA (Order 1) ==="
268
+ echo "| Metric | GainLoRA | SpecRoute | Delta |"
269
+ echo "|--------|-----------|-----------|-------|"
270
+ # Manually paste numbers from above outputs
271
+ ```
272
+
273
+ ### Step 3.3: Collect final results into comparison table
274
+
275
+ ```bash
276
+ # Optional: Create a CSV summary
277
+ python -c "
278
+ import json
279
+ import os
280
+
281
+ def get_metrics(run_name):
282
+ path = f'logs_and_outputs/{run_name}/outputs/task_order.txt'
283
+ if not os.path.exists(path):
284
+ return None
285
+ # Parse results from score.py output
286
+ # (You can modify this to auto-parse JSON results)
287
+ pass
288
+
289
+ # Create summary
290
+ print('Model,Order,Method,Cl,Fgt,Fwt,Bwt')
291
+ # Fill in from score.py outputs above
292
+ "
293
+ ```
294
+
295
+ ---
296
+
297
+ ## Part 4: Interpreting Results
298
+
299
+ ### Continual Learning Metrics
300
+
301
+ | Metric | Definition | What it means |
302
+ |--------|------------|---------------|
303
+ | **Cl** | Average accuracy on all tasks at the final step | Overall final performance. Higher is better. |
304
+ | **Fgt** | Average forgetting on previous tasks after learning all tasks | Catastrophic forgetting measure. Lower is better (ideally 0). |
305
+ | **Fwt** | Average forward transfer (using tasks learned so far) | How much earlier tasks help future tasks. Higher is better. |
306
+ | **Bwt** | Average backward transfer (final task helps previous) | How much current learning damages previous task performance. Lower is better. |
307
+
308
+ ### Expected Results (from paper baseline, Table 3)
309
+
310
+ **Llama-2-7B GainLoRA (InfLoRA):**
311
+ - Cl: ~0.45
312
+ - Fgt: ~0.12
313
+ - Fwt: ~0.42
314
+ - Bwt: ~0.09
315
+
316
+ **SpecRoute should achieve similar or better:**
317
+ - Replaces learned routing with parameter-free spectral routing
318
+ - Removes KL distillation + data replay for pure LoRA-only continual learning
319
+ - Same LoRA GPM (task-specific neuron masks)
320
+
321
+ ### What to accept/concern:
322
+
323
+ ✅ **Good signs:**
324
+ - Cl ≈ GainLoRA baseline (0.42-0.48)
325
+ - Order 1 and Order 2 have similar Cl (robust to task ordering)
326
+ - Fgt is small and stable (< 0.15)
327
+ - Training loss decreases smoothly
328
+
329
+ ⚠️ **Warning signs:**
330
+ - Cl much lower (< 0.40) → routing may not be converging
331
+ - Fgt very high (> 0.20) → catastrophic forgetting problem
332
+ - NaN in loss → numerical issue (check bf16 vs fp32)
333
+ - Early divergence → learning rate too high or initialization issue
334
+
335
+ ---
336
+
337
+ ## Part 5: Quick Troubleshooting
338
+
339
+ ### Issue: "CUDA out of memory"
340
+ ```bash
341
+ # Reduce batch size in script
342
+ # Change: --per_device_train_batch_size 2
343
+ # To: --per_device_train_batch_size 1
344
+ ```
345
+
346
+ ### Issue: "score.py not found"
347
+ ```bash
348
+ # Make sure you run from improve_gainlora/ directory
349
+ cd /path/to/improve_gainlora
350
+ python score.py ...
351
+ ```
352
+
353
+ ### Issue: "task_order.txt not found"
354
+ ```bash
355
+ # Means tasks didn't complete. Check logs:
356
+ tail -100 run_order1.log | grep -i error
357
+ ```
358
+
359
+ ### Issue: NaN loss
360
+ ```bash
361
+ # SpecRoute training uses bf16 (bfloat16).
362
+ # If server doesn't support bf16, modify src/run_llama.py:
363
+ # Change: --bf16
364
+ # To: --fp32 (but needs more GPU memory)
365
+ ```
366
+
367
+ ### Issue: Results directory structure empty
368
+ ```bash
369
+ # Check if training actually ran for each task:
370
+ ls -la logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/
371
+
372
+ # You should see: 1-task1572_samsum_summary/, 2-task363_sst2_polarity_classification/, etc.
373
+ ```
374
+
375
+ ---
376
+
377
+ ## Part 6: Advanced Usage
378
+
379
+ ### Run on Multi-GPU (if H100 has 8 GPUs)
380
+
381
+ ```bash
382
+ # Modify GPU IDs in script or run with:
383
+ deepspeed --include localhost:0,1,2,3 --master_port 49500 src/run_llama.py ...
384
+
385
+ # Or in script, change:
386
+ # deepspeed --include localhost:${1}
387
+ # To specify multiple GPUs: 0,1 or 0,1,2,3
388
+ ```
389
+
390
+ ### Run Llama-2-13B or Llama-3-8B
391
+
392
+ ```bash
393
+ # Simply change model path:
394
+ bash gen_script_superni_order1_llama_specroute.sh 0 meta-llama/Llama-2-13b-hf
395
+
396
+ # Or Llama-3:
397
+ bash gen_script_superni_order1_llama_specroute.sh 0 meta-llama/Llama-3-8b-hf
398
+
399
+ # Note: Llama-3 support not yet implemented (will raise NotImplementedError)
400
+ # Requires creating llama_3_specroute.py (similar steps as llama_specroute.py)
401
+ ```
402
+
403
+ ### Profile execution time per task
404
+
405
+ ```bash
406
+ # Add timestamps to log
407
+ for i in {1..15}; do
408
+ START=$(date +%s)
409
+ # ... run task $i ...
410
+ END=$(date +%s)
411
+ ELAPSED=$((END - START))
412
+ echo "Task $i: $ELAPSED seconds" >> timings.log
413
+ done
414
+ ```
415
+
416
+ ---
417
+
418
+ ## Summary Checklist
419
+
420
+ - [ ] Created isolated venv_llama_specroute/
421
+ - [ ] Installed PyTorch, DeepSpeed, transformers
422
+ - [ ] Verified CUDA availability
423
+ - [ ] Pre-downloaded model weights (Llama-2-7B)
424
+ - [ ] Ran single task test ✓
425
+ - [ ] Ran full Order 1 (6-10 hours)
426
+ - [ ] Ran full Order 2 (6-10 hours)
427
+ - [ ] Computed metrics with score.py for both orders
428
+ - [ ] Compared with GainLoRA baseline
429
+ - [ ] Recorded results in comparison table
430
+ - [ ] Interpreted performance (Cl, Fgt, Fwt, Bwt)
431
+
432
+ ---
433
+
434
+ ## Files Reference
435
+
436
+ | File | Purpose |
437
+ |------|---------|
438
+ | `venv_llama_specroute/` | Isolated Python environment |
439
+ | `src/llama_specroute.py` | Llama model with spectral routing |
440
+ | `src/cl_trainer_specroute_llama.py` | SpecRoute trainer (GPM + ESA) |
441
+ | `gen_script_superni_order1_llama_specroute.sh` | Task sequence 1 (15 tasks) |
442
+ | `gen_script_superni_order2_llama_specroute.sh` | Task sequence 2 (15 tasks) |
443
+ | `score.py` | Evaluation script (computes Cl, Fgt, etc.) |
444
+ | `logs_and_outputs/gen_script_superni_order{1,2}_llama_specroute/outputs/` | Results per task |
445
+ | `results/comparison_results.md` | Summary table for all methods |
446
+
447
+ ---
448
+
449
+ ## Questions?
450
+
451
+ Check existing baselines first:
452
+ ```bash
453
+ # ROOT GainLoRA InfLoRA results (reference)
454
+ python score.py gen_script_superni_order1_llama_gainlora_inflora gen_script_superni_order1_llama_gainlora_inflora
455
+
456
+ # T5 SpecRoute results (if available)
457
+ python score.py gen_script_superni_order1_t5_specroute gen_script_superni_order1_t5_specroute
458
+ ```
459
+
460
+ For theoretical background, see [SPECROUTE_IDEA.md](SPECROUTE_IDEA.md).