Text Classification
Transformers
lora
fine-tuning
adaptive
research
nested-lora
synaptic-plasticity
rank-adaptation
Instructions to use Simo76/Unified-LoRA with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Simo76/Unified-LoRA with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Simo76/Unified-LoRA")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Simo76/Unified-LoRA", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Revise README for Unified-LoRA details and findings
Browse filesUpdated README to reflect changes in adaptive LoRA fine-tuning methodology and results.
README.md
CHANGED
|
@@ -1,148 +1,181 @@
|
|
| 1 |
# Unified-LoRA
|
| 2 |
|
| 3 |
-
**Adaptive LoRA fine-tuning with
|
| 4 |
|
| 5 |
-
|
| 6 |
|
| 7 |
-
## Key
|
| 8 |
|
| 9 |
-
|
| 10 |
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
-
|
| 14 |
-
|--------|---------|-----|-------------|
|
| 15 |
-
| r=4 fixed | 0.410 | 0.323 | [0.62, 0.61, 0.04, 0.01, 0.78] |
|
| 16 |
-
| r=16 fixed | 0.439 | 0.234 | [0.73, 0.55, 0.31, 0.06, 0.55] |
|
| 17 |
-
| **FSM switching** | **0.622** | **0.174** | [0.66, 0.29, 0.70, 0.65, 0.81] |
|
| 18 |
-
| Random switching | 0.275 | 0.283 | [0.13, 0.08, 0.35, 0.01, 0.79] |
|
| 19 |
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
## How it works
|
| 27 |
|
| 28 |
-
|
|
|
|
|
|
|
| 29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
```
|
| 31 |
-
φ(t) = f(loss_EMA, instability, progress)
|
| 32 |
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
```
|
| 37 |
|
| 38 |
-
|
| 39 |
|
| 40 |
-
|
|
|
|
|
|
|
| 41 |
|
| 42 |
-
|
| 43 |
-
- Label noise, data corruption, adversarial batches
|
| 44 |
-
- The controller acts as a resilience mechanism
|
| 45 |
-
- Degrades less than fixed rank under stress
|
| 46 |
|
| 47 |
-
|
| 48 |
-
- On standard GLUE tasks without noise, r=8 ≈ r=16 ≈ r=32
|
| 49 |
-
- The rank choice doesn't matter, so the controller has no problem to solve
|
| 50 |
-
- Tested on DistilBERT (67M), TinyLlama (1.1B), Qwen2.5-3B — same conclusion
|
| 51 |
|
| 52 |
-
##
|
| 53 |
-
- Per-layer gradient EMA rank controller was tested extensively
|
| 54 |
-
- Multi-seed validation showed no benefit over fixed rank on clean data
|
| 55 |
-
- Higher variance than fixed-rank baselines
|
| 56 |
|
| 57 |
-
|
|
|
|
| 58 |
|
| 59 |
-
|
|
|
|
|
|
|
| 60 |
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
- Budget redistribution across layers — winner-takes-all problem
|
| 65 |
-
- Adaptive gradient clipping — inconsistent
|
| 66 |
-
- Vincolo StabilityController integration — zero shock events on stable training
|
| 67 |
-
- FSM with LR control only (no adapter switching) — loses to cosine scheduler
|
| 68 |
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
- FSM stress-recovery cycle validated on Tinker with Llama-3.2-1B
|
| 72 |
|
| 73 |
-
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
-
|
| 76 |
|
| 77 |
-
|
| 78 |
-
|------|-----|-----|------|
|
| 79 |
-
| r=8 | 0.876 ± 0.008 | 0.913 ± 0.004 | 8 |
|
| 80 |
-
| r=16 | 0.875 ± 0.004 | 0.913 ± 0.002 | 16 |
|
| 81 |
-
| r=32 | 0.883 ± 0.012 | 0.918 ± 0.008 | 32 |
|
| 82 |
|
| 83 |
-
|
| 84 |
|
| 85 |
-
|
|
|
|
| 86 |
|
| 87 |
-
|
| 88 |
|
| 89 |
-
|
| 90 |
-
[250] Mode=1 φ=0.333 (stable)
|
| 91 |
-
SHOCK @ step 300
|
| 92 |
-
[350] Mode=2 φ=0.827 (Mirror activated)
|
| 93 |
-
RECOVERY @ step 500
|
| 94 |
-
[550] Mode=1 φ=0.371 (return)
|
| 95 |
-
[700] Mode=1 φ=0.333 (baseline restored)
|
| 96 |
-
```
|
| 97 |
|
| 98 |
-
##
|
| 99 |
|
| 100 |
-
|
| 101 |
-
2. **Under noise, adaptive switching beats fixed rank** — the FSM provides resilience
|
| 102 |
-
3. **Switching intelligence matters** — random switching is worst
|
| 103 |
-
4. **Single-seed results are misleading** — always use multi-seed
|
| 104 |
-
5. **The simplest baseline wins on clean data** — complexity only pays under stress
|
| 105 |
|
| 106 |
-
##
|
| 107 |
|
| 108 |
-
|
| 109 |
-
pip install transformers datasets evaluate accelerate scikit-learn peft
|
| 110 |
|
| 111 |
-
#
|
| 112 |
-
python benchmark.py
|
| 113 |
|
| 114 |
-
|
| 115 |
-
|
|
|
|
|
|
|
|
|
|
| 116 |
|
| 117 |
-
#
|
| 118 |
-
python fsm_noise_test.py
|
| 119 |
-
```
|
| 120 |
|
| 121 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
|
| 127 |
## Repository structure
|
| 128 |
|
| 129 |
```
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
notebooks/
|
| 138 |
```
|
| 139 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
## Citation
|
| 141 |
|
| 142 |
-
```
|
| 143 |
@software{unified_lora_2025,
|
| 144 |
author = {Simona Vargiu},
|
| 145 |
-
title = {Unified-LoRA: Adaptive
|
| 146 |
year = {2025},
|
| 147 |
url = {https://github.com/Sva76/Unified-LoRa}
|
| 148 |
}
|
|
@@ -150,9 +183,9 @@ notebooks/ # Experiment notebooks
|
|
| 150 |
|
| 151 |
## Contact
|
| 152 |
|
| 153 |
-
Simona Vargiu (Independent Researcher)
|
| 154 |
For collaboration inquiries: simona.vargiu.malta@gmail.com
|
| 155 |
|
| 156 |
## License
|
| 157 |
|
| 158 |
-
Apache License 2.0 — see LICENSE for details.
|
|
|
|
| 1 |
# Unified-LoRA
|
| 2 |
|
| 3 |
+
**Adaptive LoRA fine-tuning with nested orbital rank control.**
|
| 4 |
|
| 5 |
+
A closed-loop controller that dynamically adjusts LoRA rank during training based on observed stress, using a single adapter with sliced dimensions — no cold start, no capacity loss on transitions.
|
| 6 |
|
| 7 |
+
## Key results
|
| 8 |
|
| 9 |
+
### Stress test: task switch (MRPC → SST-2, DistilBERT, 3 seeds)
|
| 10 |
|
| 11 |
+
| | Baseline (r=16 fixed) | Unified (orbital) | Delta |
|
| 12 |
+
|------------------------|-----------------------|-------------------|----------|
|
| 13 |
+
| SST-2 Acc (new task) | 0.736 | 0.740 | **+0.004** |
|
| 14 |
+
| MRPC F1 (retention) | 0.526 | 0.515 | -0.011 |
|
| 15 |
+
| Effective rank | 16.0 | 13.6 | |
|
| 16 |
+
| Rank saving | 0% | **15%** | |
|
| 17 |
|
| 18 |
+
Under distribution shift, the controller adapts capacity dynamically with 15% rank saving and no performance loss.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
+
### Rank trace under shock (Seed 1)
|
| 21 |
+
|
| 22 |
+
```
|
| 23 |
+
[ 0] r4 r4 r4 r8 r8 r8 r8 r16 r16 r16 ← ground state → stress → ascend
|
| 24 |
+
[ 10] r16 r16 r16 r16 r16 r16 r16 r16 r16 r16 ← MRPC at full capacity
|
| 25 |
+
...
|
| 26 |
+
[ 60] <<<SHOCK r16 r16 r16 r16 r16 r16 r16 r16 ← task switch to SST-2
|
| 27 |
+
[ 68] r8 r8 r8 r8 r8 r8 r4 r4 r4 r4 ← controller detects shift, descends
|
| 28 |
+
[ 80] r4 r4 r4 r4 r4 r4 r4 r4 r4 r4 ← stable at ground state
|
| 29 |
+
[ 92] r8 r16 r16 r16 r16 r16 r16 r16 r16 r16 ← new task needs capacity, re-ascends
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
The controller exhibits **disturbance rejection**: detects the shock, descends to ground state, stabilizes, then re-ascends only when the new task demands capacity.
|
| 33 |
+
|
| 34 |
+
### Stable task (MRPC only, 120 steps, 3 seeds)
|
| 35 |
+
|
| 36 |
+
| | Baseline (r=16) | Unified | Delta |
|
| 37 |
+
|--------------|-----------------|---------|--------|
|
| 38 |
+
| F1 mean | 0.818 | 0.820 | +0.002 |
|
| 39 |
+
| σ | 0.008 | 0.008 | = |
|
| 40 |
+
|
| 41 |
+
On stable training, the controller recognizes no intervention is needed and stays at r=16. Zero degradation.
|
| 42 |
|
| 43 |
## How it works
|
| 44 |
|
| 45 |
+
### Architecture: nested orbitals (r4 ⊂ r8 ⊂ r16)
|
| 46 |
+
|
| 47 |
+
Unlike standard multi-adapter approaches (separate A/B matrices per rank), Unified-LoRA uses a **single pair** of matrices with rank controlled via slicing:
|
| 48 |
|
| 49 |
+
```python
|
| 50 |
+
# One particle, multiple orbitals
|
| 51 |
+
self.lora_A = Parameter(shape=[max_rank, in_features]) # shared
|
| 52 |
+
self.lora_B = Parameter(shape=[out_features, max_rank]) # shared
|
| 53 |
+
|
| 54 |
+
# Active rank = slice
|
| 55 |
+
h = x @ A[:r, :].T # use first r rows
|
| 56 |
+
delta = h @ B[:, :r].T # use first r columns
|
| 57 |
```
|
|
|
|
| 58 |
|
| 59 |
+
When descending from r=16 to r=4, dimensions 0-3 retain all learned weights. Dimensions 4-15 are paused, not destroyed. When ascending back, they resume where they left off.
|
| 60 |
+
|
| 61 |
+
**This solves the cold start problem** that caused F1 degradation in earlier versions with separate adapters.
|
| 62 |
+
|
| 63 |
+
### Controller: orbital trajectory with memory
|
| 64 |
+
|
| 65 |
+
The controller implements closed-loop rank control:
|
| 66 |
+
|
| 67 |
+
```
|
| 68 |
+
Stress → ascend to higher orbital, push delta to stack
|
| 69 |
+
Stable → pop delta from stack, symmetric return
|
| 70 |
+
Neutral → hold position, don't move
|
| 71 |
```
|
| 72 |
|
| 73 |
+
The stress signal φ(t) combines loss deviation from EMA with spike detection:
|
| 74 |
|
| 75 |
+
```
|
| 76 |
+
φ(t) = |loss - EMA(loss)| + 2.0 × max(0, loss - prev_loss)
|
| 77 |
+
```
|
| 78 |
|
| 79 |
+
Thresholds are **adaptive** (μ ± kσ of recent φ history), so the controller auto-calibrates to any model/task scale without manual tuning.
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
+
This is not a scheduler, not a rank budget, not a learning rate trick. It is a **trajectory controller** over model capacity.
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
+
## Quick start
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
+
```python
|
| 86 |
+
from controller import setup_unified_lora, set_rank
|
| 87 |
|
| 88 |
+
# One-call setup
|
| 89 |
+
model, ctrl = setup_unified_lora(model, max_rank=16)
|
| 90 |
+
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
|
| 91 |
|
| 92 |
+
# Training loop
|
| 93 |
+
for step, batch in enumerate(train_loader):
|
| 94 |
+
loss = model(**batch).loss
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
+
new_rank = ctrl.step(loss.item())
|
| 97 |
+
set_rank(model, new_rank)
|
|
|
|
| 98 |
|
| 99 |
+
loss.backward()
|
| 100 |
+
optimizer.step()
|
| 101 |
+
optimizer.zero_grad()
|
| 102 |
+
```
|
| 103 |
|
| 104 |
+
## What works and what doesn't
|
| 105 |
|
| 106 |
+
### Works: distribution shift / noisy training
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
|
| 108 |
+
Under task switch, label noise, or data corruption, the controller adapts rank dynamically. Demonstrated on:
|
| 109 |
|
| 110 |
+
- **Task switch** (MRPC → SST-2): parity + 15% saving, disturbance rejection confirmed
|
| 111 |
+
- **Label noise** (50%, DistilBERT/MRPC, 5 seeds): FSM switching F1=0.622 vs best fixed rank F1=0.439
|
| 112 |
|
| 113 |
+
### Works: black-box training (API / enterprise)
|
| 114 |
|
| 115 |
+
The controller observes only loss trajectory — no access to gradients, internal activations, or optimizer state. Compatible with API-based fine-tuning endpoints where internal signals are not exposed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
|
| 117 |
+
### Doesn't help: clean stable training
|
| 118 |
|
| 119 |
+
On standard GLUE tasks without perturbation, rank choice doesn't matter (r=8 ≈ r=16 ≈ r=32 from 67M to 3B parameters). The controller correctly recognizes this and stays at max rank — no harm, but no benefit.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 120 |
|
| 121 |
+
## Experimental evolution
|
| 122 |
|
| 123 |
+
This project tested many approaches. In the interest of scientific honesty:
|
|
|
|
| 124 |
|
| 125 |
+
### Tested and didn't help (clean data)
|
|
|
|
| 126 |
|
| 127 |
+
- **Separate adapters per rank** (V1-V4): cold start on transitions caused 3-6 point F1 loss vs baseline. Each rank switch activated an adapter with independent weights that hadn't benefited from previous training. Solved by nested architecture.
|
| 128 |
+
- **Adaptive rank per-layer** (gradient EMA): no performance benefit over fixed rank
|
| 129 |
+
- **Fluid dynamics metrics** (shock, vorticity, swirl): too conservative as stress signals
|
| 130 |
+
- **Trend-aware hysteresis** with fixed thresholds: controller either never activated or got stuck at intermediate rank
|
| 131 |
+
- **Budget redistribution** across layers: winner-takes-all problem
|
| 132 |
|
| 133 |
+
### What works
|
|
|
|
|
|
|
| 134 |
|
| 135 |
+
- **Nested orbital architecture**: zero cold start, parity with baseline guaranteed
|
| 136 |
+
- **Trajectory controller with orbital memory**: disturbance rejection under task switch
|
| 137 |
+
- **Adaptive thresholds** (μ ± kσ): auto-calibrates across models and tasks
|
| 138 |
+
- **FSM adapter switching under noise**: measurably better performance and lower variance
|
| 139 |
+
|
| 140 |
+
## Computational overhead
|
| 141 |
|
| 142 |
+
The controller adds O(1) computation per step: one EMA update, one threshold comparison, one stack operation. No SVD, no matrix decomposition. Negligible relative to the training step.
|
| 143 |
+
|
| 144 |
+
## Control-theoretic framing
|
| 145 |
+
|
| 146 |
+
| Method | Control type | Rank dynamics |
|
| 147 |
+
|-------------------------|-----------------|-----------------------|
|
| 148 |
+
| Standard LoRA | None | rank = constant |
|
| 149 |
+
| AdaLoRA | Open-loop | rank = f(step) |
|
| 150 |
+
| **Unified-LoRA** | **Closed-loop** | rank = f(stress(t)) |
|
| 151 |
+
|
| 152 |
+
Unified-LoRA introduces orbit-aware rank transitions: each capacity increase is tracked and reversed only under confirmed stability, preventing premature compression and oscillatory collapse.
|
| 153 |
|
| 154 |
## Repository structure
|
| 155 |
|
| 156 |
```
|
| 157 |
+
controller.py # NestedLoRALinear + OrbitalController
|
| 158 |
+
experiments/
|
| 159 |
+
stress_test_task_switch.py # MRPC → SST-2 stress test (key result)
|
| 160 |
+
stable_task_test.py # Single-task parity test
|
| 161 |
+
docs/
|
| 162 |
+
experimental_results.md # Detailed results and rank traces
|
| 163 |
+
architecture.md # Nested orbital design
|
| 164 |
+
notebooks/ # Experiment notebooks
|
| 165 |
```
|
| 166 |
|
| 167 |
+
## Open questions
|
| 168 |
+
|
| 169 |
+
- Does nested orbital control scale to 7B+ models? (Tinker validation in progress)
|
| 170 |
+
- What is the minimum shock magnitude that triggers measurable benefit?
|
| 171 |
+
- Does adaptive LR control (black-box analog) show the same pattern on API platforms?
|
| 172 |
+
|
| 173 |
## Citation
|
| 174 |
|
| 175 |
+
```bibtex
|
| 176 |
@software{unified_lora_2025,
|
| 177 |
author = {Simona Vargiu},
|
| 178 |
+
title = {Unified-LoRA: Adaptive Fine-Tuning with Nested Orbital Rank Control},
|
| 179 |
year = {2025},
|
| 180 |
url = {https://github.com/Sva76/Unified-LoRa}
|
| 181 |
}
|
|
|
|
| 183 |
|
| 184 |
## Contact
|
| 185 |
|
| 186 |
+
**Simona Vargiu** (Independent Researcher)
|
| 187 |
For collaboration inquiries: simona.vargiu.malta@gmail.com
|
| 188 |
|
| 189 |
## License
|
| 190 |
|
| 191 |
+
Apache License 2.0 — see [LICENSE](LICENSE) for details.
|