Simo76 commited on
Commit
b39f929
Β·
1 Parent(s): ba85c35

Revise experimental results for Unified-LoRA

Browse files

Updated the experimental results section to improve clarity and structure, including new quantitative and qualitative analyses of Unified-LoRA's performance under various conditions.

Files changed (1) hide show
  1. docs/experimental_results.md +131 -93
docs/experimental_results.md CHANGED
@@ -1,144 +1,182 @@
1
- ## πŸ“Š Experimental Evidence β€” Rank Dynamics under Disturbance
2
 
3
- This section summarizes the **qualitative experimental evidence** supporting the design of **Unified-LoRA**, focusing on *rank dynamics* rather than downstream accuracy.
 
 
4
 
5
- The goal is **not** to compete on SOTA benchmarks, but to demonstrate a **structural difference** in how model capacity is controlled during fine-tuning.
 
 
 
 
6
 
7
- ---
8
 
9
- ## Experimental Setting
 
 
 
 
 
10
 
11
- All methods were evaluated under **identical conditions**:
12
 
13
- - **Model:** `Qwen/Qwen3-4B-Instruct-2507`
14
- - **Task:** GLUE CoLA (classification, autoregressive formulation)
15
- - **Environment:** Tinker (black-box setting β€” loss not directly observable)
16
- - **Hardware:** Standard cloud GPU (T4-class)
17
- - **Training length:** ~60 steps per method
18
 
19
- This setup reflects realistic **API-based / enterprise fine-tuning**, where internal loss signals are not exposed.
20
 
21
- ---
 
 
 
 
 
 
 
 
 
22
 
23
- ## Methods Compared
 
 
 
 
 
 
 
 
 
24
 
25
- | Method | Category | Control Logic |
26
- |------|---------|---------------|
27
- | Standard LoRA | Baseline | Fixed rank |
28
- | Schedule-free / Fixed Rank | Baseline+ | Fixed rank, optimized LR |
29
- | AdaLoRA-like | Open-loop adaptive | Rank = function of time |
30
- | **Unified-LoRA (proposed)** | **Closed-loop continuous** | **Rank = function of stress** |
 
 
 
 
31
 
32
- ---
33
 
34
- ## Rank Dynamics β€” Comparative Analysis
 
 
 
 
35
 
36
- ### Axes
37
- - **X-axis:** training step (0 β†’ ~60)
38
- - **Y-axis:** effective LoRA rank
39
 
40
- ### 1️⃣ AdaLoRA-like (budget-based)
41
 
42
- - Stepwise, monotonic decreasing trajectory
43
- - Starts at **rank = 32**
44
- - Slowly decays according to a predefined schedule
45
- - At step ~60 remains around **rank β‰ˆ 23–24**
46
- - **No reaction** to shocks or dynamic changes
47
 
48
- **Interpretation:**
49
- Adaptive *offline*, but **blind to the real training state**. Rank allocation follows a schedule, not feedback.
50
 
51
- ---
 
 
 
52
 
53
- ### 2️⃣ Schedule-free / Standard LoRA
54
 
55
- - Flat trajectory
56
- - **Fixed rank = 16**
57
- - No dynamics, no feedback, no adaptation
 
 
 
58
 
59
- **Interpretation:**
60
- A stable but **capacity-blind baseline**. Learning rate optimization cannot compensate for lack of structural flexibility.
61
 
62
- ---
63
 
64
- ### 3️⃣ Unified-LoRA (loss-proxy + injected shocks)
65
 
66
- - Continuous, **non-monotonic** trajectory
67
- - Starts from **rank = 6** (minimum capacity)
68
- - Progressively grows up to **rank β‰ˆ 31**
69
- - **Immediate reaction** to injected disturbances (e.g. steps ~20, ~30, ~45)
70
- - No unstable oscillations observed
71
 
72
- **Interpretation:**
73
- True **closed-loop control** over model capacity. Rank adapts to *observed stress*, not to a predefined schedule.
 
 
 
74
 
75
- ---
76
 
77
- ## πŸ“Œ Key Observation β€” Disturbance Rejection
78
 
79
- | Method | Shock Reaction | Stability | Recovery |
80
- |------|----------------|----------|----------|
81
- | Standard / Schedule-free | ❌ None | Passive | β€” |
82
- | AdaLoRA-like | ⚠️ Indirect, delayed | Partial | Limited |
83
- | **Unified-LoRA** | βœ… Immediate | Stable | Immediate |
 
84
 
85
- πŸ‘‰ **Only Unified-LoRA exhibits disturbance rejection**, a property expected from closed-loop control systems and absent in open-loop approaches.
86
 
87
- ---
88
 
89
- ## Control-Theoretic Interpretation
90
 
91
- - **Standard / Schedule-free / AdaLoRA:** open-loop control
92
- - **Unified-LoRA:** closed-loop continuous control
93
 
94
- Formally:
95
 
96
- Standard / AdaLoRA: rank = f(step)
97
- Unified-LoRA: rank = f(stress(step, history))
 
 
 
98
 
 
99
 
100
- Where **stress** is a continuous, smoothed, normalized signal derived from observable training dynamics.
101
 
102
- ---
103
 
104
- ## Why Black-Box Matters
105
 
106
- Unified-LoRA operates **without direct access to the loss**.
107
 
108
- In Tinker-like environments, the system observes *trajectory-level signals*, not internal optimization variables.
 
 
 
 
 
109
 
110
- > β€œI observe the missile trajectory, not the engine β€” yet I can still control it.”
111
 
112
- This capability is critical for:
113
- - API-based fine-tuning
114
- - enterprise training pipelines
115
- - safety- or cost-constrained environments
116
 
117
- ---
118
 
119
- ## Computational Overhead
120
 
121
- Unified-LoRA introduces:
 
 
 
 
122
 
123
- - **O(1)** computation per step
124
- - No SVD
125
- - No matrix decomposition
126
- - Negligible overhead relative to the training step
127
 
128
- ---
129
 
130
- ## Takeaway
 
 
 
 
131
 
132
- Unified-LoRA is:
133
- - **not** a scheduler
134
- - **not** a rank budget
135
- - **not** a learning-rate trick
136
 
137
- It implements a **dynamic controller over model capacity**.
138
 
139
- At equal training conditions:
140
- - higher stability
141
- - better resource utilization
142
 
143
- Under disturbances:
144
- - **it is the only method that reacts correctly**
 
 
 
 
1
+ # Experimental Results
2
 
3
+ ## 1. Stress Test β€” Task Switch (Quantitative)
4
+
5
+ ### Setup
6
 
7
+ - **Model**: DistilBERT-base-uncased + NestedLoRALinear (max_rank=16)
8
+ - **Protocol**: MRPC Γ— 60 steps β†’ SST-2 Γ— 60 steps (shock at step 60)
9
+ - **Seeds**: 0, 1, 2 (same seed = same batch order for baseline and unified)
10
+ - **Baseline**: Same architecture, rank=16 fixed, no controller
11
+ - **Hardware**: Google Colab, T4 GPU
12
 
13
+ ### Results
14
 
15
+ | | Baseline (r=16 fixed) | Unified (orbital) | Delta |
16
+ |------------------------|-----------------------|-------------------|----------|
17
+ | SST-2 Acc (new task) | 0.736 | 0.740 | +0.004 |
18
+ | MRPC F1 (retention) | 0.526 | 0.515 | -0.011 |
19
+ | Effective rank | 16.0 | 13.6 | |
20
+ | Rank saving | 0% | 15% | |
21
 
22
+ ### Per-seed detail
23
 
24
+ | Seed | Baseline SST-2 | Unified SST-2 | Baseline MRPC | Unified MRPC | Eff rank | Transitions |
25
+ |------|----------------|---------------|---------------|--------------|----------|-------------|
26
+ | 0 | 0.759 | 0.760 | 0.588 | 0.595 | 13.7 | 6 |
27
+ | 1 | 0.649 | 0.664 | 0.783 | 0.781 | 13.2 | 6 |
28
+ | 2 | 0.799 | 0.795 | 0.207 | 0.169 | 13.8 | 8 |
29
 
30
+ ### Rank traces
31
 
32
+ **Seed 0:**
33
+ ```
34
+ [ 0] r4 r4 r4 r4 r8 r8 r16 r16 r16 r16
35
+ [ 10] r16 r16 r16 r16 r16 r16 r16 r16 r16 r16
36
+ ...
37
+ [ 60] <<<SHOCK r16 r16 r16 r16 r16 r16 r16 r16 r16 r16
38
+ [ 70] r16 r8 r8 r8 r8 r8 r8 r8 r8 r8
39
+ [ 80] r4 r4 r4 r4 r4 r4 r4 r4 r4 r8
40
+ [ 90] r8 r8 r8 r16 r16 r16 r16 r16 r16 r16
41
+ ```
42
 
43
+ **Seed 1 (cleanest trajectory):**
44
+ ```
45
+ [ 0] r4 r4 r4 r8 r8 r8 r8 r16 r16 r16
46
+ [ 10] r16 r16 r16 r16 r16 r16 r16 r16 r16 r16
47
+ ...
48
+ [ 60] <<<SHOCK r16 r16 r16 r16 r16 r16 r16 r16 r8 r8
49
+ [ 70] r8 r8 r8 r8 r4 r4 r4 r4 r4 r4
50
+ [ 80] r4 r4 r4 r4 r4 r4 r4 r4 r4 r4
51
+ [ 90] r4 r4 r8 r16 r16 r16 r16 r16 r16 r16
52
+ ```
53
 
54
+ **Seed 2:**
55
+ ```
56
+ [ 0] r4 r8 r8 r8 r8 r8 r16 r16 r16 r16
57
+ [ 10] r16 r16 r16 r16 r16 r16 r16 r16 r16 r16
58
+ ...
59
+ [ 60] <<<SHOCK r8 r8 r16 r16 r16 r16 r16 r16 r16 r16
60
+ [ 70] r16 r16 r16 r16 r8 r8 r8 r8 r8 r8
61
+ [ 80] r8 r8 r8 r4 r4 r4 r4 r4 r4 r4
62
+ [ 90] r8 r8 r8 r8 r8 r16 r16 r16 r16 r16
63
+ ```
64
 
65
+ ### Interpretation
66
 
67
+ All three seeds show the same pattern post-shock:
68
+ 1. Controller detects the distribution shift (loss spike after task switch)
69
+ 2. Descends through orbitals: r16 β†’ r8 β†’ r4
70
+ 3. Stabilizes at ground state for 10-18 steps
71
+ 4. Re-ascends when new task complexity demands capacity: r4 β†’ r8 β†’ r16
72
 
73
+ The baseline stays at r=16 for all 120 steps regardless of the shock. It has no mechanism to detect or respond to the distribution shift.
 
 
74
 
 
75
 
76
+ ## 2. Stable Task β€” Single Task Parity (Quantitative)
 
 
 
 
77
 
78
+ ### Setup
 
79
 
80
+ - **Model**: DistilBERT-base-uncased + NestedLoRALinear (max_rank=16)
81
+ - **Task**: MRPC only, 120 steps
82
+ - **Seeds**: 0, 1, 2
83
+ - **Baseline**: Same architecture, rank=16 fixed
84
 
85
+ ### Results
86
 
87
+ | Seed | Baseline F1 | Unified F1 | Delta |
88
+ |------|-------------|------------|--------|
89
+ | 0 | 0.806 | 0.808 | +0.002 |
90
+ | 1 | 0.822 | 0.826 | +0.004 |
91
+ | 2 | 0.824 | 0.824 | +0.000 |
92
+ | **Mean** | **0.818 Β± 0.008** | **0.820 Β± 0.008** | **+0.002** |
93
 
94
+ The controller correctly identifies that no intervention is needed on a stable task and remains at r=16 for nearly all steps. Parity confirmed β€” the controller never hurts.
 
95
 
 
96
 
97
+ ## 3. Rank Dynamics under Disturbance (Qualitative β€” Tinker)
98
 
99
+ ### Setup
 
 
 
 
100
 
101
+ - **Model**: Qwen/Qwen3-4B-Instruct-2507
102
+ - **Task**: GLUE CoLA (classification, autoregressive formulation)
103
+ - **Environment**: Tinker (black-box β€” loss not directly observable)
104
+ - **Hardware**: Cloud GPU (T4-class)
105
+ - **Training length**: ~60 steps per method
106
 
107
+ This setup reflects API-based / enterprise fine-tuning, where internal loss signals are not exposed.
108
 
109
+ ### Methods compared
110
 
111
+ | Method | Category | Control logic |
112
+ |----------------------|-----------------------|-------------------------|
113
+ | Standard LoRA | Baseline | Fixed rank |
114
+ | Schedule-free | Baseline+ | Fixed rank, optimized LR|
115
+ | AdaLoRA-like | Open-loop adaptive | Rank = f(step) |
116
+ | Unified-LoRA | Closed-loop continuous| Rank = f(stress) |
117
 
118
+ ### Observations
119
 
120
+ **AdaLoRA-like**: monotonic decreasing trajectory from rank=32 to ~24. No reaction to shocks. Adaptive offline, but blind to real training state.
121
 
122
+ **Standard / Schedule-free LoRA**: flat trajectory at fixed rank. No dynamics, no adaptation.
123
 
124
+ **Unified-LoRA**: non-monotonic trajectory. Starts from rank=6, grows to ~31, immediate reaction to injected disturbances at steps ~20, ~30, ~45. No unstable oscillations.
 
125
 
126
+ ### Disturbance rejection
127
 
128
+ | Method | Shock reaction | Stability | Recovery |
129
+ |-------------------------|----------------|-----------|-----------|
130
+ | Standard / Schedule-free| None | Passive | β€” |
131
+ | AdaLoRA-like | Indirect | Partial | Limited |
132
+ | Unified-LoRA | Immediate | Stable | Immediate |
133
 
134
+ Only Unified-LoRA exhibits disturbance rejection β€” a property of closed-loop control systems, absent in open-loop approaches.
135
 
 
136
 
137
+ ## 4. Architecture Evolution β€” What Didn't Work
138
 
139
+ ### Separate adapters (V1-V4)
140
 
141
+ Four versions of the controller were tested with independent adapter matrices per rank (r=4, r=8, r=16 as separate nn.Linear pairs):
142
 
143
+ | Version | Mean F1 | Ξ” vs baseline | Saving | Problem |
144
+ |----------------|---------|---------------|--------|--------------------------------------|
145
+ | V1 Homeostatic | 0.850 | +0.002* | 62% | No baseline in same run |
146
+ | V2 State-Aware | 0.812 | -0.036 | 46% | Cold start on transitions |
147
+ | V3 State Ctrl | 0.817 | -0.031 | 47% | Stuck at r=8 on 2/3 seeds |
148
+ | V4 Trend-Aware | 0.821 | -0.027 | 14% | Never activated on 2/3 seeds |
149
 
150
+ *V1 baseline was from a different run, not directly comparable.
151
 
152
+ **Root cause**: switching between separate adapters means the new adapter has independent weights that never benefited from training at the previous rank. Every transition is a partial cold start.
 
 
 
153
 
154
+ **Solution**: nested orbital architecture (single A/B pair, rank via slicing). This eliminated the cold start entirely and achieved parity with baseline.
155
 
156
+ ### Other approaches that didn't help on clean data
157
 
158
+ - Adaptive rank per-layer (gradient EMA): no performance benefit
159
+ - Fluid dynamics metrics (shock, vorticity, swirl): too conservative
160
+ - Budget redistribution across layers: winner-takes-all problem
161
+ - Fixed-threshold hysteresis: controller either never activated or got stuck
162
+ - Vincolo StabilityController integration: zero shock events on stable training
163
 
 
 
 
 
164
 
165
+ ## 5. Black-Box Compatibility
166
 
167
+ The controller operates without access to:
168
+ - Gradients
169
+ - Internal activations
170
+ - Optimizer state
171
+ - Per-layer information
172
 
173
+ It observes only the loss trajectory. This makes it compatible with API-based fine-tuning platforms (Azure OpenAI, Tinker) where the training loop is exposed but model internals are not.
 
 
 
174
 
175
+ Computational overhead: O(1) per step. No SVD, no matrix decomposition.
176
 
 
 
 
177
 
178
+ ## Open Questions
179
+
180
+ - Scale validation on 7B+ models (Tinker experiments in progress)
181
+ - Minimum shock magnitude required for measurable controller benefit
182
+ - Adaptive LR modulation as black-box analog of rank control (for platforms where rank is fixed at creation)