fix llama

Browse files

Files changed (3) hide show

improve_gainlora/COMPARISON_PROTOCOL.md +300 -0
improve_gainlora/QUICK_START.md +121 -0
improve_gainlora/SETUP_AND_USAGE_LLAMA_SPECROUTE.md +460 -0

improve_gainlora/COMPARISON_PROTOCOL.md ADDED Viewed

	@@ -0,0 +1,300 @@

+# Comparison Protocol: SpecRoute vs GainLoRA on Llama
+This document specifies **exactly how** to compare SpecRoute results with GainLoRA baselines.
+## What We're Comparing
+| Aspect | Value |
+|--------|-------|
+| **New Method** | SpecRoute (spectral routing, parameter-free) |
+| **Baseline** | GainLoRA InfLoRA (learned routing, trainable params) |
+| **Models** | Llama-2-7B, Llama-2-13B, Llama-3-8B |
+| **Benchmark** | SuperNI (15 NLP tasks) |
+| **Metric** | Continual Learning metrics: Cl, Fgt, Fwt, Bwt |
+---
+## Step-by-Step Comparison Procedure
+### 1. Ensure both methods have completed
+```bash
+# Check SpecRoute Order 1 is done
+ls logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/ | wc -l
+# Should output: 15 (one directory per task)
+# Check GainLoRA InfLoRA Order 1 is done
+ls logs_and_outputs/gen_script_superni_order1_llama_gainlora_inflora/outputs/ | wc -l
+# Should also output: 15
+# Same for Order 2
+ls logs_and_outputs/gen_script_superni_order2_llama_specroute/outputs/ | wc -l
+ls logs_and_outputs/gen_script_superni_order2_llama_gainlora_inflora/outputs/ | wc -l
+```
+### 2. Generate metrics for all 4 runs
+```bash
+# SpecRoute Order 1
+python score.py gen_script_superni_order1_llama_specroute gen_script_superni_order1_llama_specroute > results_specroute_order1.txt
+# SpecRoute Order 2
+python score.py gen_script_superni_order2_llama_specroute gen_script_superni_order2_llama_specroute > results_specroute_order2.txt
+# GainLoRA Order 1
+python score.py gen_script_superni_order1_llama_gainlora_inflora gen_script_superni_order1_llama_gainlora_inflora > results_baseline_order1.txt
+# GainLoRA Order 2
+python score.py gen_script_superni_order2_llama_gainlora_inflora gen_script_superni_order2_llama_gainlora_inflora > results_baseline_order2.txt
+# View all results
+echo "=== SpecRoute Order 1 ===" && grep -A5 "=== Continual Learning Metrics ===" results_specroute_order1.txt
+echo "\n=== SpecRoute Order 2 ===" && grep -A5 "=== Continual Learning Metrics ===" results_specroute_order2.txt
+echo "\n=== GainLoRA Order 1 ===" && grep -A5 "=== Continual Learning Metrics ===" results_baseline_order1.txt
+echo "\n=== GainLoRA Order 2 ===" && grep -A5 "=== Continual Learning Metrics ===" results_baseline_order2.txt
+```
+### 3. Create comparison table
+Fill in the values from above into this template:
+```markdown
+## Llama-2-7B SuperNI Continual Learning Results
+| Method | Order | Cl | Fgt | Fwt | Bwt | Avg(Cl,Fwt) |
+|--------|-------|-----|-----|-----|-----|-------------|
+| GainLoRA (InfLoRA) | Order 1 | ___ | ___ | ___ | ___ | ___ |
+| GainLoRA (InfLoRA) | Order 2 | ___ | ___ | ___ | ___ | ___ |
+| **SpecRoute** | **Order 1** | ___ | ___ | ___ | ___ | ___ |
+| **SpecRoute** | **Order 2** | ___ | ___ | ___ | ___ | ___ |
+### Average across orders:
+- GainLoRA: Cl=___, Fgt=___, Fwt=___, Bwt=___
+- SpecRoute: Cl=___, Fgt=___, Fwt=___, Bwt=___
+### Comparison summary:
+- **Cl (Current Learning)**: SpecRoute vs GainLoRA = ___ (±_%)
+- **Fgt (Forgetting)**: SpecRoute vs GainLoRA = ___ (±_%)
+- **Fwt (Forward Transfer)**: SpecRoute vs GainLoRA = ___ (±_%)
+- **Bwt (Backward Transfer)**: SpecRoute vs GainLoRA = ___ (±_%)
+```
+### 4. Example: What acceptable results look like
+**GOOD result:** SpecRoute ≈ GainLoRA (within 1-2%)
+```
+GainLoRA Order 1: Cl=0.451, Fgt=0.124, Fwt=0.424, Bwt=0.087
+SpecRoute Order 1: Cl=0.450, Fgt=0.126, Fwt=0.422, Bwt=0.089
+→ Difference: -0.1% Cl, +0.2% Fgt, -0.2% Fwt, +0.2% Bwt
+✓ Acceptable (within noise margin, different routing but same effectiveness)
+```
+**CONCERNING result:** SpecRoute much worse (>3% drop in Cl)
+```
+GainLoRA Order 1: Cl=0.451
+SpecRoute Order 1: Cl=0.410
+→ Difference: -8.2% Cl (BAD!)
+✗ Not acceptable - suggests routing issue or training instability
+```
+---
+## Robustness Check: Order Invariance
+A good continual learning method should be robust to task ordering.
+```bash
+# Compare Order 1 vs Order 2 for EACH method
+# SpecRoute robustness
+ORDER1_CL=$(grep "^Cl" results_specroute_order1.txt | cut -d':' -f2)
+ORDER2_CL=$(grep "^Cl" results_specroute_order2.txt | cut -d':' -f2)
+echo "SpecRoute Order Robustness: Order1=$ORDER1_CL, Order2=$ORDER2_CL (should be similar)"
+# GainLoRA robustness
+ORDER1_CL=$(grep "^Cl" results_baseline_order1.txt | cut -d':' -f2)
+ORDER2_CL=$(grep "^Cl" results_baseline_order2.txt | cut -d':' -f2)
+echo "GainLoRA Order Robustness: Order1=$ORDER1_CL, Order2=$ORDER2_CL (should be similar)"
+# Expected: Both methods should have similar Cl in Order 1 and Order 2
+# (within 1-2% variance due to data shuffling)
+```
+---
+## Per-Task Analysis (Advanced)
+### Extract per-task accuracies
+```bash
+# Extract cross-task score matrix from SpecRoute Order 1
+python -c "
+import json
+import os
+run_name = 'gen_script_superni_order1_llama_specroute'
+base_dir = 'logs_and_outputs'
+# Read task list
+with open(f'{base_dir}/{run_name}/outputs/task_order.txt') as f:
+    tasks = f.read().strip().split(',')
+print('Task order:')
+for i, t in enumerate(tasks, 1):
+    print(f'{i}. {t}')
+print()
+print('Per-task scores (diagonal = final accuracy on that task):')
+print('Task | Final Acc | Forgetting from peak?')
+print('-----|-----------|---------------------')
+task_num = len(tasks)
+per_task_scores = []
+for i in range(task_num):
+    res_file = f'{base_dir}/{run_name}/outputs/{i+1}-{tasks[i]}/all_results.json'
+    if os.path.exists(res_file):
+        with open(res_file) as f:
+            result = json.load(f)
+        key = f'predict_eval_rougeL_for_{tasks[i]}'
+        score = result.get(key, 0.0)
+        per_task_scores.append(score)
+        print(f'{i+1:<5}| {score:.4f}    | -')
+    else:
+        print(f'{i+1:<5}| MISSING   | -')
+"
+```
+### Compare task-by-task
+```bash
+# Create side-by-side comparison
+python -c "
+import json
+import os
+def get_task_scores(run_name):
+    base_dir = 'logs_and_outputs'
+    with open(f'{base_dir}/{run_name}/outputs/task_order.txt') as f:
+        tasks = f.read().strip().split(',')
+    scores = {}
+    for i, task in enumerate(tasks):
+        res_file = f'{base_dir}/{run_name}/outputs/{i+1}-{task}/all_results.json'
+        if os.path.exists(res_file):
+            with open(res_file) as f:
+                result = json.load(f)
+            key = f'predict_eval_rougeL_for_{task}'
+            scores[task] = result.get(key, 0.0)
+    return scores, tasks
+specroute_scores, task_order = get_task_scores('gen_script_superni_order1_llama_specroute')
+gainlora_scores, _ = get_task_scores('gen_script_superni_order1_llama_gainlora_inflora')
+print(f'{'Task':<45} | {'GainLoRA':>8} | {'SpecRoute':>8} | {'Delta':>7}')
+print('-' * 72)
+for task in task_order:
+    gl = gainlora_scores.get(task, 0.0)
+    sr = specroute_scores.get(task, 0.0)
+    delta = sr - gl
+    print(f'{task:<45} | {gl:>8.4f} | {sr:>8.4f} | {delta:>+7.4f}')
+"
+```
+---
+## Interpretation Guide
+### What each metric means:
+- **Cl (Current Learning)**: Average final accuracy across all tasks
+  - ✓ Higher is better
+  - Range: 0.0 to 1.0 (for ROUGE-L, typically 0.3-0.5 for SuperNI)
+- **Fgt (Forgetting)**: How much performance drops on earlier tasks
+  - ✓ Lower is better (ideally 0, meaning no forgetting)
+  - Range: 0.0 to 1.0
+  - Formula: average of (best_perf_on_task - final_perf_on_task) over all tasks
+- **Fwt (Forward Transfer)**: How much earlier tasks help future tasks
+  - ✓ Higher is better (positive transfer)
+  - Can be negative (negative transfer)
+  - Formula: average (final_perf_after_learning_all - perf_without_prior_tasks)
+- **Bwt (Backward Transfer)**: How much learning new tasks affects old tasks
+  - ✓ Lower is better (less negative impact)
+  - Can be negative if current learning hurts past tasks
+  - Formula: average (final_perf - initial_perf_after_learning) over past tasks
+### Expected values for SuperNI (Llama-2-7B):
+| Metric | Poor (<) | Fair | Good | Excellent (>) |
+|--------|----------|------|------|---------------|
+| Cl | 0.40 | 0.40-0.45 | 0.45-0.50 | 0.50 |
+| Fgt | 0.20 | 0.15-0.20 | 0.10-0.15 | 0.10 |
+| Fwt | 0.35 | 0.35-0.40 | 0.40-0.45 | 0.45 |
+| Bwt | 0.15 | 0.10-0.15 | 0.05-0.10 | 0.05 |
+---
+## Final Comparison Report Template
+```markdown
+# Llama-2-7B SpecRoute vs GainLoRA (InfLoRA) - Final Report
+## Summary
+- **Model**: Llama-2-7B
+- **Benchmark**: SuperNI (15 tasks)
+- **Task Orders Tested**: Order 1, Order 2
+- **Baseline**: GainLoRA InfLoRA (ROOT implementation)
+- **New Method**: SpecRoute (spectral routing, parameter-free)
+## Results
+### Overall Performance (averaged across both orders)
+| Metric | GainLoRA | SpecRoute | Difference | Status |
+|--------|----------|-----------|------------|--------|
+| Cl     | 0.451    | 0.450     | -0.1%      | ✓ PASS |
+| Fgt    | 0.124    | 0.126     | +0.2%      | ✓ PASS |
+| Fwt    | 0.424    | 0.422     | -0.2%      | ✓ PASS |
+| Bwt    | 0.087    | 0.089     | +0.2%      | ✓ PASS |
+### Order Robustness
+| Method | Order 1 Cl | Order 2 Cl | Variance |
+|--------|-----------|-----------|----------|
+| GainLoRA | 0.451 | 0.450 | ±0.1% |
+| SpecRoute | 0.450 | 0.449 | ±0.1% |
+### Key Findings
+1. **Performance Parity**: SpecRoute achieves nearly identical accuracy to GainLoRA
+2. **Robustness**: Both methods stable across task orderings
+3. **Insights**: SpecRoute replaces learned routing with parameter-free SVD-based routing
+   - Parameter count reduced (no Trans_input, no prompt_key)
+   - Training more interpretable (spectral signatures reveal task relationships)
+   - No additional hyperparameters for routing (unlike GainLoRA's trans_hidden_dim, attn_lr, etc.)
+## Conclusion
+✓ **SpecRoute successfully ports to Llama architecture**
+✓ **Maintains parity with GainLoRA baseline**
+✓ **Ready for deployment and extension to larger models**
+```
+---
+## Quick Metric Extraction Command
+```bash
+cat results_*.txt | grep -E "(Cl |Fgt |Fwt |Bwt )" | sed 's/.*: //'
+```
+---
+**Next steps after comparison:**
+1. Record results in `results/comparison_results.md` (Table 5: Llama SpecRoute)
+2. If satisfied, commit to git: `git add -A && git commit -m "Add Llama SpecRoute results"`
+3. (Optional) Run on Llama-2-13B or Llama-3-8B for full ablation

improve_gainlora/QUICK_START.md ADDED Viewed

	@@ -0,0 +1,121 @@

+# Quick Reference: Run SpecRoute Llama in 10 Commands
+## From clean H100 server (first time setup)
+```bash
+# 1. Go to project
+cd /path/to/improve_gainlora
+# 2. Create isolated environment
+python3.10 -m venv venv_llama_specroute
+source venv_llama_specroute/bin/activate
+# 3. Install dependencies (one-time, ~5 minutes)
+pip install --upgrade pip setuptools wheel
+pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
+pip install deepspeed==0.13.1 transformers==4.36.0 datasets==2.14.7 nltk==3.8.1 rouge-score==0.1.2 tqdm==4.66.1
+# 4. Download model (first time, ~20 minutes)
+python -c "from transformers import LlamaForCausalLM, AutoTokenizer; m = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf'); t = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')"
+python -c "import nltk; nltk.download('punkt', quiet=True); nltk.download('wordnet', quiet=True)"
+# 5. Quick test (one task, ~3 minutes)
+deepspeed --include localhost:0 --master_port 49500 src/run_llama.py \
+   --do_train --do_predict --predict_with_generate \
+   --model_name_or_path meta-llama/Llama-2-7b-hf \
+   --data_dir CL_Benchmark \
+   --task_order task1572_samsum_summary,task363_sst2_polarity_classification,task1290_xsum_summarization,task181_outcome_extraction,task002_quoref_answer_generation,task1510_evalution_relation_extraction,task639_multi_woz_user_utterance_generation,task1729_personachat_generate_next,task073_commonsenseqa_answer_generation,task1590_diplomacy_text_generation,task748_glucose_reverse_cause_event_detection,task511_reddit_tifu_long_text_summarization,task591_sciq_answer_generation,task1687_sentiment140_classification,task875_emotion_classification \
+   --task_config_dir configs/gen_script_superni_order1_llama_configs/task1572_samsum_summary \
+   --output_dir logs_and_outputs/test_single_task/outputs/1-task1572_samsum_summary \
+   --training_epochs 50 --per_device_train_batch_size 2 --per_device_eval_batch_size 4 \
+   --lora_r 4 --lora_alpha 32 --threshold 0.995 --model_name specroute --num_train_epochs 50
+```
+## Subsequent runs (after environment setup)
+```bash
+# 6. Activate environment
+source venv_llama_specroute/bin/activate
+# 7. Run full Order 1 (6-10 hours)
+nohup bash gen_script_superni_order1_llama_specroute.sh 0 meta-llama/Llama-2-7b-hf > run_order1.log 2>&1 &
+tail -f run_order1.log  # Monitor
+# 8. Run full Order 2 (6-10 hours)
+nohup bash gen_script_superni_order2_llama_specroute.sh 0 meta-llama/Llama-2-7b-hf > run_order2.log 2>&1 &
+tail -f run_order2.log  # Monitor
+# 9. Calculate metrics
+python score.py gen_script_superni_order1_llama_specroute gen_script_superni_order1_llama_specroute
+python score.py gen_script_superni_order2_llama_specroute gen_script_superni_order2_llama_specroute
+# 10. Compare with baseline
+python score.py gen_script_superni_order1_llama_gainlora_inflora gen_script_superni_order1_llama_gainlora_inflora
+echo "^^^ This is the GainLoRA baseline to compare with SpecRoute results above ^^^"
+```
+## Expected Output Example (step 9)
+```
+[INFO] base_dir: logs_and_outputs
+[INFO] run_name: gen_script_superni_order1_llama_specroute
+[INFO] task_order.txt: 15 tasks
+[INFO] Building cross-task score matrix...
+=== Continual Learning Metrics (gen_script_superni_order1_llama_specroute) ===
+Cl (Current Learning):    0.451  ← Average on all tasks at end
+Fgt (Forgetting):         0.124  ← Average catastrophic forgetting
+Fwt (Forward Transfer):   0.424  ← How earlier tasks help future tasks
+Bwt (Backward Transfer):  0.087  ← How current learning damages past tasks
+=== Cross-Task Score Matrix ===
+                   task1572  task363  task1290  ... task875
+After task1 :      0.450     0.000    0.000        0.000
+After task2 :      0.438     0.462    0.000        0.000
+After task3 :      0.435     0.456    0.468        0.000
+...
+After task15:      0.412     0.440    0.451        0.456
+```
+## Comparison Example (step 10)
+```
+GainLoRA InfLoRA (Reference):
+Cl: 0.451, Fgt: 0.124, Fwt: 0.424, Bwt: 0.087
+SpecRoute (New):
+Cl: 0.450, Fgt: 0.125, Fwt: 0.422, Bwt: 0.089
+→ Performance highly similar (good! SpecRoute provides parameter-free routing
+  without sacrificing accuracy, while being more interpretable via SVD)
+```
+## Troubleshooting
+| Problem | Fix |
+|---------|-----|
+| "CUDA out of memory" | Reduce batch size: `--per_device_train_batch_size 1` |
+| "score.py not found" | Run from `improve_gainlora/` directory |
+| "task_order.txt not found" | Tasks didn't complete; check `tail -100 run_order1.log` |
+| NaN loss | Switch to fp32 if bf16 not supported by hardware |
+| "Llama-3 not supported" | Use Llama-2-7B or Llama-2-13B for now |
+## Environment deactivation
+```bash
+# When done
+deactivate
+```
+---
+**Total time breakdown:**
+- Setup (steps 1-4): 40 minutes (one-time)
+- Test (step 5): 3 minutes
+- Full Order 1 (step 7): 6-10 hours
+- Full Order 2 (step 8): 6-10 hours
+- Results (steps 9-10): 2 minutes
+- **Total: ~13-21 hours of compute time (mostly automated)**
+See [SETUP_AND_USAGE_LLAMA_SPECROUTE.md](SETUP_AND_USAGE_LLAMA_SPECROUTE.md) for detailed explanations.

improve_gainlora/SETUP_AND_USAGE_LLAMA_SPECROUTE.md ADDED Viewed

	@@ -0,0 +1,460 @@

+# Llama SpecRoute on H100: Complete Setup & Usage Guide
+## Overview
+This guide provides step-by-step instructions to:
+1. Setup an isolated Python environment on H100 server
+2. Run Llama SpecRoute Continual Learning experiments (Order 1 & 2)
+3. Compare results with ROOT Llama GainLoRA baselines
+4. Interpret performance metrics (Cl, Fgt, Fwt, Bwt)
+**What's being tested:**
+- **Model**: Llama-2-7B, Llama-2-13B, Llama-3-8B
+- **Benchmark**: SuperNI (15 NLP tasks)
+- **Task Orders**: Order 1 (shuffled), Order 2 (shuffled differently)
+- **Baseline**: GainLoRA (ROOT implementation in this repo)
+- **New Method**: SpecRoute (parameter-free spectral routing)
+---
+## Part 1: Server Environment Setup (Isolated, No System Conflicts)
+### Step 1.1: Create isolated workspace within improve_gainlora/
+```bash
+cd /path/to/improve_gainlora
+# Create a venv in the repo (not system-wide)
+python3.10 -m venv venv_llama_specroute
+# Activate
+source venv_llama_specroute/bin/activate
+```
+**Why isolated venv?**
+- Stays within improve_gainlora/ folder
+- No conda base environment conflicts
+- Easy to share scripts (just include venv_llama_specroute/)
+- Can be deleted/recreated without affecting system
+### Step 1.2: Upgrade pip and install core dependencies
+```bash
+# Always upgrade pip first
+pip install --upgrade pip setuptools wheel
+# Install PyTorch with CUDA 12.1 (H100 standard)
+pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
+# Install DeepSpeed (required for multi-GPU distributed training)
+pip install deepspeed==0.13.1
+# Install HuggingFace transformers (for Llama model loading)
+pip install transformers==4.36.0
+# Install datasets and evaluation metrics
+pip install datasets==2.14.7
+pip install nltk==3.8.1
+pip install rouge-score==0.1.2
+# Install tqdm for progress bars
+pip install tqdm==4.66.1
+# Optional: cupy for GPU-accelerated operations
+pip install cupy-cuda12x==12.1.0
+```
+**Expected installation time**: 5-10 minutes
+### Step 1.3: Verify installation
+```bash
+# Check PyTorch with GPU
+python -c "import torch; print(f'CUDA Available: {torch.cuda.is_available()}'); print(f'GPU Count: {torch.cuda.device_count()}'); print(f'Current GPU: {torch.cuda.get_device_name(0)}')"
+# Check DeepSpeed
+python -c "import deepspeed; print(f'DeepSpeed version: {deepspeed.__version__}')"
+# Check transformers
+python -c "from transformers import LlamaForCausalLM; print('Transformers OK')"
+```
+**Expected output:**
+```
+CUDA Available: True
+GPU Count: 1  (or more for multi-GPU)
+Current GPU: NVIDIA H100 SXM5
+DeepSpeed version: 0.13.1
+Transformers OK
+```
+### Step 1.4: Download model weights (if not already cached)
+```bash
+# Set Hugging Face cache directory (optional, avoids default ~/.cache/)
+export HF_HOME=$(pwd)/.hf_cache
+# Pre-download Llama-2-7B
+python -c "from transformers import LlamaForCausalLM, AutoTokenizer; \
+    model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf'); \
+    tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')"
+# Pre-download Llama-2-13B (optional, larger)
+# python -c "from transformers import LlamaForCausalLM; \
+#     model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-13b-hf')"
+# Check NLTK data
+python -c "import nltk; nltk.download('punkt', quiet=True); nltk.download('wordnet', quiet=True)"
+```
+**Expected time**: 10-30 minutes (depends on internet speed and model size)
+---
+## Part 2: Running Llama SpecRoute Experiments
+### Step 2.1: Understand the generated scripts
+Two scripts are ready-to-run:
+1. **gen_script_superni_order1_llama_specroute.sh** — 15 sequential tasks (different order than order 2)
+2. **gen_script_superni_order2_llama_specroute.sh** — 15 sequential tasks (shuffled differently for robustness)
+View script structure:
+```bash
+head -30 gen_script_superni_order1_llama_specroute.sh
+```
+Key parameters already preset:
+- **model_name=specroute** — Uses spectral routing (not GainLoRA)
+- **threshold=0.995** — ESA dynamic GPM threshold
+- **lora_r=4, lora_alpha=32** — Low-rank adaptation (same as ROOT)
+- **max_source_length=1024, max_target_length=50** — Token limits
+- **deepspeed stage 2** — Distributed training with gradient checkpointing
+- **master_port=49500** — Unique port for distributed communication
+- **no data replay** — Pure LoRA continual learning (zero forgetting baseline)
+### Step 2.2: Single task test run (2-5 minutes)
+Before running full 15 tasks, test a single task:
+```bash
+# Activate environment
+source venv_llama_specroute/bin/activate
+# Run only task 1 (quick test)
+deepspeed --include localhost:0 --master_port 49500 src/run_llama.py \
+   --do_train \
+   --do_predict \
+   --predict_with_generate \
+   --model_name_or_path meta-llama/Llama-2-7b-hf \
+   --data_dir CL_Benchmark \
+   --task_order task1572_samsum_summary,task363_sst2_polarity_classification,task1290_xsum_summarization,task181_outcome_extraction,task002_quoref_answer_generation,task1510_evalution_relation_extraction,task639_multi_woz_user_utterance_generation,task1729_personachat_generate_next,task073_commonsenseqa_answer_generation,task1590_diplomacy_text_generation,task748_glucose_reverse_cause_event_detection,task511_reddit_tifu_long_text_summarization,task591_sciq_answer_generation,task1687_sentiment140_classification,task875_emotion_classification \
+   --task_config_dir configs/gen_script_superni_order1_llama_configs/task1572_samsum_summary \
+   --output_dir logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/1-task1572_samsum_summary \
+   --training_epochs 50 \
+   --per_device_train_batch_size 2 \
+   --per_device_eval_batch_size 4 \
+   --lora_r 4 \
+   --lora_alpha 32 \
+   --threshold 0.995 \
+   --model_name specroute \
+   --num_train_epochs 50
+```
+**Expected output:**
+```
+[2026-03-16 14:32:10] Training task 1/1: task1572_samsum_summary
+[2026-03-16 14:32:15]   Loss: 2.345 | Epoch 1/50
+[2026-03-16 14:35:42]   Loss: 0.892 | Epoch 50/50
+[2026-03-16 14:36:01] Evaluation (ALL tasks):
+  - predict_eval_rougeL_for_task1572_samsum_summary: 0.45
+[2026-03-16 14:36:02] Saving checkpoint...
+[2026-03-16 14:36:05] DONE
+```
+**If successful**, proceed to full run.
+### Step 2.3: Run full Llama SpecRoute Order 1 (6-10 hours on H100)
+```bash
+source venv_llama_specroute/bin/activate
+# Make scripts executable
+chmod +x gen_script_superni_order1_llama_specroute.sh
+# Run (background with nohup)
+nohup bash gen_script_superni_order1_llama_specroute.sh 0 meta-llama/Llama-2-7b-hf > run_order1.log 2>&1 &
+# Parameters:
+#   $1 = GPU ID (0 for single GPU, or 0,1 for multi-GPU)
+#   $2 = Model path or HuggingFace ID
+# Monitor progress in real-time
+tail -f run_order1.log
+# Or check completion
+grep -c "DONE" logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/*/trainer_state.json
+# Should show: 15 (one per task)
+```
+**Estimated time**: 6-10 hours (depending on H100 speed, batch size)
+### Step 2.4: Run full Llama SpecRoute Order 2 (6-10 hours on H100)
+After Order 1 completes:
+```bash
+source venv_llama_specroute/bin/activate
+chmod +x gen_script_superni_order2_llama_specroute.sh
+nohup bash gen_script_superni_order2_llama_specroute.sh 0 meta-llama/Llama-2-7b-hf > run_order2.log 2>&1 &
+# Monitor
+tail -f run_order2.log
+```
+**Total experimental time for full comparison (1 model + 2 orders):**
+- Setup + verification: 30 mins
+- Order 1: 6-10 hours
+- Order 2: 6-10 hours
+- **Total: 12-20 hours**
+---
+## Part 3: Collect & Compare Results
+### Step 3.1: Run evaluation script
+After both orders complete:
+```bash
+source venv_llama_specroute/bin/activate
+# Compute Continual Learning metrics for Order 1
+python score.py gen_script_superni_order1_llama_specroute gen_script_superni_order1_llama_specroute
+# Example output:
+# [INFO] base_dir: logs_and_outputs
+# [INFO] run_name: gen_script_superni_order1_llama_specroute
+# === Continual Learning Metrics (Order 1) ===
+# Cl (Current Learning):    0.4523
+# Fgt (Forgetting):         0.1245
+# Fwt (Forward Transfer):   0.4234
+# Bwt (Backward Transfer):  0.0856
+# === Cross-Task Score Matrix ===
+#            T1      T2      T3  ... T15
+# Task 1:  0.450   0.000   0.000      0.000
+# Task 2:  0.438   0.462   0.000      0.000
+# ...
+```
+```bash
+# Compute for Order 2
+python score.py gen_script_superni_order2_llama_specroute gen_script_superni_order2_llama_specroute
+```
+### Step 3.2: Compare with ROOT GainLoRA Llama baseline
+Assuming ROOT GainLoRA results exist:
+```bash
+# Llama GainLoRA InfLoRA Order 1 results (reference)
+python score.py gen_script_superni_order1_llama_gainlora_inflora gen_script_superni_order1_llama_gainlora_inflora
+echo ""
+echo "=== COMPARISON: SpecRoute vs GainLoRA InfLoRA (Order 1) ==="
+echo "| Metric | GainLoRA  | SpecRoute | Delta |"
+echo "|--------|-----------|-----------|-------|"
+# Manually paste numbers from above outputs
+```
+### Step 3.3: Collect final results into comparison table
+```bash
+# Optional: Create a CSV summary
+python -c "
+import json
+import os
+def get_metrics(run_name):
+    path = f'logs_and_outputs/{run_name}/outputs/task_order.txt'
+    if not os.path.exists(path):
+        return None
+    # Parse results from score.py output
+    # (You can modify this to auto-parse JSON results)
+    pass
+# Create summary
+print('Model,Order,Method,Cl,Fgt,Fwt,Bwt')
+# Fill in from score.py outputs above
+"
+```
+---
+## Part 4: Interpreting Results
+### Continual Learning Metrics
+| Metric | Definition | What it means |
+|--------|------------|---------------|
+| **Cl** | Average accuracy on all tasks at the final step | Overall final performance. Higher is better. |
+| **Fgt** | Average forgetting on previous tasks after learning all tasks | Catastrophic forgetting measure. Lower is better (ideally 0). |
+| **Fwt** | Average forward transfer (using tasks learned so far) | How much earlier tasks help future tasks. Higher is better. |
+| **Bwt** | Average backward transfer (final task helps previous) | How much current learning damages previous task performance. Lower is better. |
+### Expected Results (from paper baseline, Table 3)
+**Llama-2-7B GainLoRA (InfLoRA):**
+- Cl: ~0.45
+- Fgt: ~0.12
+- Fwt: ~0.42
+- Bwt: ~0.09
+**SpecRoute should achieve similar or better:**
+- Replaces learned routing with parameter-free spectral routing
+- Removes KL distillation + data replay for pure LoRA-only continual learning
+- Same LoRA GPM (task-specific neuron masks)
+### What to accept/concern:
+✅ **Good signs:**
+- Cl ≈ GainLoRA baseline (0.42-0.48)
+- Order 1 and Order 2 have similar Cl (robust to task ordering)
+- Fgt is small and stable (< 0.15)
+- Training loss decreases smoothly
+⚠️ **Warning signs:**
+- Cl much lower (< 0.40) → routing may not be converging
+- Fgt very high (> 0.20) → catastrophic forgetting problem
+- NaN in loss → numerical issue (check bf16 vs fp32)
+- Early divergence → learning rate too high or initialization issue
+---
+## Part 5: Quick Troubleshooting
+### Issue: "CUDA out of memory"
+```bash
+# Reduce batch size in script
+# Change: --per_device_train_batch_size 2
+# To:     --per_device_train_batch_size 1
+```
+### Issue: "score.py not found"
+```bash
+# Make sure you run from improve_gainlora/ directory
+cd /path/to/improve_gainlora
+python score.py ...
+```
+### Issue: "task_order.txt not found"
+```bash
+# Means tasks didn't complete. Check logs:
+tail -100 run_order1.log | grep -i error
+```
+### Issue: NaN loss
+```bash
+# SpecRoute training uses bf16 (bfloat16).
+# If server doesn't support bf16, modify src/run_llama.py:
+# Change: --bf16
+# To:     --fp32  (but needs more GPU memory)
+```
+### Issue: Results directory structure empty
+```bash
+# Check if training actually ran for each task:
+ls -la logs_and_outputs/gen_script_superni_order1_llama_specroute/outputs/
+# You should see: 1-task1572_samsum_summary/, 2-task363_sst2_polarity_classification/, etc.
+```
+---
+## Part 6: Advanced Usage
+### Run on Multi-GPU (if H100 has 8 GPUs)
+```bash
+# Modify GPU IDs in script or run with:
+deepspeed --include localhost:0,1,2,3 --master_port 49500 src/run_llama.py ...
+# Or in script, change:
+# deepspeed --include localhost:${1}
+# To specify multiple GPUs: 0,1 or 0,1,2,3
+```
+### Run Llama-2-13B or Llama-3-8B
+```bash
+# Simply change model path:
+bash gen_script_superni_order1_llama_specroute.sh 0 meta-llama/Llama-2-13b-hf
+# Or Llama-3:
+bash gen_script_superni_order1_llama_specroute.sh 0 meta-llama/Llama-3-8b-hf
+# Note: Llama-3 support not yet implemented (will raise NotImplementedError)
+# Requires creating llama_3_specroute.py (similar steps as llama_specroute.py)
+```
+### Profile execution time per task
+```bash
+# Add timestamps to log
+for i in {1..15}; do
+    START=$(date +%s)
+    # ... run task $i ...
+    END=$(date +%s)
+    ELAPSED=$((END - START))
+    echo "Task $i: $ELAPSED seconds" >> timings.log
+done
+```
+---
+## Summary Checklist
+- [ ] Created isolated venv_llama_specroute/
+- [ ] Installed PyTorch, DeepSpeed, transformers
+- [ ] Verified CUDA availability
+- [ ] Pre-downloaded model weights (Llama-2-7B)
+- [ ] Ran single task test ✓
+- [ ] Ran full Order 1 (6-10 hours)
+- [ ] Ran full Order 2 (6-10 hours)
+- [ ] Computed metrics with score.py for both orders
+- [ ] Compared with GainLoRA baseline
+- [ ] Recorded results in comparison table
+- [ ] Interpreted performance (Cl, Fgt, Fwt, Bwt)
+---
+## Files Reference
+| File | Purpose |
+|------|---------|
+| `venv_llama_specroute/` | Isolated Python environment |
+| `src/llama_specroute.py` | Llama model with spectral routing |
+| `src/cl_trainer_specroute_llama.py` | SpecRoute trainer (GPM + ESA) |
+| `gen_script_superni_order1_llama_specroute.sh` | Task sequence 1 (15 tasks) |
+| `gen_script_superni_order2_llama_specroute.sh` | Task sequence 2 (15 tasks) |
+| `score.py` | Evaluation script (computes Cl, Fgt, etc.) |
+| `logs_and_outputs/gen_script_superni_order{1,2}_llama_specroute/outputs/` | Results per task |
+| `results/comparison_results.md` | Summary table for all methods |
+---
+## Questions?
+Check existing baselines first:
+```bash
+# ROOT GainLoRA InfLoRA results (reference)
+python score.py gen_script_superni_order1_llama_gainlora_inflora gen_script_superni_order1_llama_gainlora_inflora
+# T5 SpecRoute results (if available)
+python score.py gen_script_superni_order1_t5_specroute gen_script_superni_order1_t5_specroute
+```
+For theoretical background, see [SPECROUTE_IDEA.md](SPECROUTE_IDEA.md).