Samarth0710
/

cross-model-lora-prediction

Safetensors

Model card Files Files and versions

xet

Community

Samarth0710 commited on Apr 22

Commit

02174eb

verified ·

1 Parent(s): 919a583

Add full research report

Browse files

Files changed (1) hide show

REPORT.md +305 -0

REPORT.md ADDED Viewed

	@@ -0,0 +1,305 @@

+# Cross-Model LoRA Adapter Prediction via Learned Adapter-to-Adapter Mappings
+**Author:** Samarth0710 (with HF Agent)
+**Repo:** https://huggingface.co/Samarth0710/cross-model-lora-prediction
+**Date:** April 2026
+---
+## Abstract
+We investigate whether a LoRA adapter for a *new* base LLM on a *new* task can be **predicted from a LoRA adapter trained on a different base LLM for the same task**, given a small set of paired anchor adapters across the two models for a handful of *other* tasks. Concretely, given 4 LoRA adapters trained on Model X (one per task A,B,C,D) and 3 LoRA adapters trained on Model Y (only for tasks A,B,C), we learn a mapping `f: X-adapter → Y-adapter` from the three anchor pairs and apply it to predict `Ŷ_D = f(X_D)` — a Y-side adapter for task D that is never trained. We run two experiments: (i) a 3-anchor proof-of-concept on 4 text-classification tasks; (ii) a 25-anchor scaled experiment with 5 held-out tasks and 5 mapping variants ranging from a global linear regression to a Sakana-style PCA-latent MLP hypernetwork. We find that with only 3 anchors the predicted adapter is statistically indistinguishable from the simple mean of known Y-adapters (cos≈0.95 to oracle, accuracy +1.5 pp over baseline). Scaling to 25 anchors, all four learned mappings consistently beat the mean baseline by 1.5–2.2 average accuracy points across 5 held-out tasks, with one held-out task (ethos_binary) where the predicted adapter *outperforms the actually-trained oracle*. Cross-model adapter prediction is real, simple, and works — and its quality is bottlenecked by the number of paired anchors, exactly as predicted by the Sakana Text-to-LoRA literature.
+---
+## 1. Idea and Motivation
+LoRA adapters are now the dominant mechanism for cheaply specializing LLMs to tasks. But every adapter is **tied to a specific base model**: an adapter trained on Llama-3.2-1B does not work on Qwen2.5-0.5B because the parameter shapes, layer counts, and learned features differ. Whenever the community switches to a newer/better base model, the entire library of task-specific LoRAs must in principle be retrained.
+Two recent lines of work address this from opposite ends:
+- **Trans-LoRA** (arXiv 2405.17258) treats the problem as cross-model *transfer*: distill the old LoRA's behaviour into synthetic data and retrain a new LoRA on the new base. Effective, but it requires running training again per (base, task) pair.
+- **Sakana AI Text-to-LoRA / T2L** (arXiv 2506.06105) learns a hypernetwork conditioned on a natural-language *task description* that emits LoRA A and B matrices in a single forward pass. Trained on 479 task-specific LoRAs for one base model, T2L generalises to unseen task descriptions zero-shot. T2L still ties the generated adapters to the base model used during hypernet training.
+The idea explored here sits between these two and is, to our knowledge, not directly studied in published work:
+> **If we have many LoRA adapters for Model X and a smaller subset of paired LoRA adapters for Model Y, can we *learn the mapping between adapter spaces* and use it to "translate" a Model-X adapter for a held-out task into a Model-Y adapter — without ever training Model Y on that task?**
+This replaces the *text* condition of T2L with an *adapter* condition: instead of asking "what adapter would solve this task?", we ask "what is the Model-Y analogue of this Model-X adapter I already have?". The map is learned from `(X_t, Y_t)` pairs for `t ∈ anchor tasks` and applied to a previously-unseen Model-X adapter `X_target` to predict `Ŷ_target`.
+Why this should plausibly work: trained LoRA adapters for the same task on different base models tend to encode **similar semantic shifts** (e.g., "increase the probability of the token *positive* given a movie-review-shaped context"). Even though the parameter spaces differ, a mapping from one to the other should be far simpler than learning the adapter from scratch.
+Why this might fail: the mapping has only as many independent training signals as anchor tasks, which is far smaller than the dimensionality of the adapter — a classic n-d-much-bigger-than-N regression problem.
+---
+## 2. Experimental Setup
+### 2.1 Base models
+| | Model | Hidden | Layers | Heads | KV heads |
+|---|---|---:|---:|---:|---:|
+| **X** | `Qwen/Qwen2.5-0.5B-Instruct` | 896 | 24 | 14 | 2 |
+| **Y** | `meta-llama/Llama-3.2-1B-Instruct` | 2048 | 16 | 32 | 8 |
+Different families, different parameter counts, different shapes per LoRA tensor (X q-proj A: 8×896; Y q-proj A: 8×2048), and different layer counts. This is intentionally hard.
+### 2.2 LoRA configuration
+- Rank `r=8`, `α=16`, dropout 0.
+- Target modules: `q_proj`, `v_proj`. Total adapter size: **540 672 params for X**, **851 968 params for Y**.
+- 64 trainable tensors per Y adapter (16 layers × 2 modules × {A, B}).
+### 2.3 Tasks
+All tasks are text classification SFT in chat format. The user prompt lists the label set; the assistant target is one of the labels. Evaluation uses greedy generation and label-prefix matching on 300–400 held-out examples.
+**Experiment 1 (3-anchor proof of concept)**: A=SST-2, B=AG News, C=SetFit/subj, D=dair-ai/emotion (held out for Y).
+**Experiment 2 (25-anchor scaled)**: 25 anchor tasks + 5 held-out test tasks, all small text-classification datasets:
+- *Anchors (25)*: tweet_eval × 9 (hate, irony, offensive, sentiment, stance × 5), sst2, sst5, ag_news, subj, CR, amazon_counterfactual, enron_spam, hate_speech_offensive, insincere_questions, amazon_reviews_5, toxic_conversations, ade, 20_newsgroups, imdb, rotten_tomatoes, dbpedia.
+- *Held-out test (5)*: emotion, tweet_emotion, bbc_news, ethos_binary, trec.
+The anchor pool is intentionally biased toward sentiment/toxicity/topic tasks; emotion and bbc_news are partial out-of-distribution probes; trec is fully out-of-distribution (question-type taxonomy).
+### 2.4 Adapter training
+For each (model, task) pair we SFT-fine-tune for 1 epoch with the TRL `SFTTrainer` and the PEFT `LoraConfig`:
+- bf16 mixed precision; batch size 8; learning rate 2e-4 cosine; warmup 5%; max sequence length 192.
+- 1500 train examples in Exp. 1, **800 train examples** in Exp. 2 (kept short to make 60 LoRA trainings finish in ~30 min).
+This is deliberately a *modest* SFT recipe. It is enough to pull each base model from random-baseline performance toward the task, while keeping per-LoRA training under one minute on an A10G. All LoRA adapters are saved as standard PEFT checkpoints.
+### 2.5 Hardware
+A single NVIDIA A10G (24 GB). Total wall-clock for Experiment 2 was approximately 30 minutes for 60 LoRA trainings plus 10 minutes for mapping + evaluation.
+---
+## 3. Mapping Functions
+Let `X_anchors = {X_1, …, X_N}` and `Y_anchors = {Y_1, …, Y_N}` be the paired adapter sets, viewed as flat vectors of dimension `d_X` and `d_Y` respectively. We want `f` such that `Y_i ≈ f(X_i)` and we will apply it to `X_target` for unseen tasks.
+**Bottom line**: with `N` anywhere between 3 and 25 and `d_Y ≈ 850 000`, a fully-parameterised linear map (a `d_Y × d_X` matrix with hundreds of billions of parameters) is hopelessly under-determined. Every method we use exploits the fact that the mapping must live in a low-dimensional subspace.
+### 3.1 Mean baseline (`mean`)
+`Ŷ = mean(Y_anchors)`. Ignores `X_target` entirely. This is the strawman every other method must beat — if a method does not beat this, the X-side adapter contributes nothing.
+### 3.2 Global anchor-basis ridge (`global_ridge`)
+Flatten the entire adapter into one vector. Centre `X_c[i] = X_i − X̄`, `Y_c[i] = Y_i − Ȳ`. Solve a tiny `N × N` ridge regression for the coefficients α that best reconstruct `X_target − X̄` from the centred X-anchors:
+```
+α = (X_c X_cᵀ + λI)⁻¹ X_c (X_target − X̄)        # N-dim
+Ŷ = Ȳ + α · Y_c
+```
+Equivalently: project `X_target` onto the affine span of the X-anchors and take the corresponding affine combination on the Y side. With `λ=1e-3` and `N=25`, this is a 25×25 linear system.
+### 3.3 Per-tensor anchor-basis ridge (`pertensor_ridge`)
+Same as global ridge but applied independently to each Y tensor (e.g., `model.layers.7.self_attn.q_proj.lora_B`). Each Y tensor gets its own `α ∈ ℝ^N`. Layer mismatch between X (24 layers) and Y (16 layers) is handled by aligning each Y layer to the *normalized-position-nearest* X layer (`L_X = round(L_Y · (n_X − 1)/(n_Y − 1))`).
+### 3.4 Per-tensor PCA-linear (`pertensor_pca`)
+For each Y tensor, take the top `K=8` principal components of the centred X-anchor matrix and the centred Y-anchor matrix independently. Project the anchors into their respective PC spaces, learn a `K × K` linear map between them with ridge, and reconstruct:
+```
+V_X, V_Y = top-K right singular vectors of centred X, Y anchor matrices
+Z_X = X_c V_Xᵀ ;  Z_Y = Y_c V_Yᵀ                     # both (N, K)
+W   = (Z_Xᵀ Z_X + λI)⁻¹ Z_Xᵀ Z_Y                    # (K, K)
+ẑ_target_Y = ((X_target − X̄) V_Xᵀ) W
+Ŷ           = Ȳ + ẑ_target_Y V_Y
+```
+Compared to `pertensor_ridge`, this uses Y-side basis information from each tensor's PCA, which is potentially more expressive if the Y anchors have non-trivial structure within each tensor.
+### 3.5 Per-tensor PCA-latent MLP hypernetwork (`pertensor_mlp`) — the T2L analogue
+Same per-tensor PCA setup as 3.4, but the latent map `W` is replaced by a **small, shared MLP** trained jointly across *all* (layer × module) blocks:
+```
+MLP: K=8 → 64 → GELU → 64 → GELU → 8   (residual: out = z + MLP(z))
+```
+For each block we standardise the per-dimension scale of `Z_X` and `Z_Y`, stack across blocks (giving `T × N` training points where `T = 16 layers × 2 modules × 2 (A,B) = 64` and `N = 25`), and train the MLP for 400 steps of full-batch Adam (lr 1e-3, weight decay 1e-4) on the MSE loss `‖MLP(Z_X) − Z_Y‖²`. At inference, the predicted `X_target` block latent is mapped through the MLP, de-standardised, projected back through `V_Y`, and added to `Ȳ`.
+This is the closest analogue of the Sakana Text-to-LoRA hypernetwork in our setting: instead of conditioning on a text embedding, we condition on the X-side adapter's latent in PC space. The MLP can in principle capture non-linear structure that the linear `W` of 3.4 cannot.
+---
+## 4. Results
+### 4.1 Experiment 1 (3 anchors) — Proof of concept
+Held-out task = D = `dair-ai/emotion`. 6-way classification. 400 eval examples.
+| Method | Accuracy on D | Cosine to oracle Y_D |
+|---|---:|---:|
+| base Y (no adapter) | 0.308 | — |
+| Y_A on D | 0.510 | 0.947 |
+| Y_B on D | 0.538 | 0.927 |
+| Y_C on D | 0.470 | 0.942 |
+| **mean(Y_A,Y_B,Y_C) baseline** | **0.505** | 0.957 |
+| global anchor-basis ridge `Ŷ = f(X_D)` | 0.520 | 0.951 |
+| per-tensor anchor-basis ridge | (cos 0.952) | 0.952 |
+| oracle Y_D (actually trained) | 0.665 | 1.000 |
+| base X | 0.285 | — |
+| oracle X_D (sanity, on Model X) | 0.608 | — |
+The predicted adapter recovers ~59% of the gap from base-Y to oracle-Y_D, beating the mean baseline by 1.5 pp. Crucially, the *per-tensor* ridge does **not** beat the global ridge — confirming that with only 3 anchors the bottleneck is the information available, not the flexibility of the mapping.
+The α coefficients of the global ridge were `[-0.43, -0.01, 0.12]` on (A, B, C). The predicted adapter is therefore largely the anchor mean plus a small negative pull away from the SST-2 anchor.
+### 4.2 Experiment 2 (25 anchors, 5 held-out tasks)
+Average accuracy across the 5 held-out test tasks:
+| Method | AVG accuracy |
+|---|---:|
+| base Y | 0.313 |
+| **mean(Y_anchors) baseline** | **0.305** |
+| global anchor-basis ridge | **0.327** (+2.2 over mean) |
+| per-tensor anchor-basis ridge | 0.320 (+1.5) |
+| per-tensor PCA-linear (K=8) | 0.321 (+1.6) |
+| per-tensor PCA-MLP hypernet | 0.319 (+1.4) |
+| oracle Y (actually trained on the task) | 0.507 |
+Per-task breakdown:
+| Task | base_Y | mean | global_ridge | per_ridge | per_pca | per_mlp | oracle |
+|---|---:|---:|---:|---:|---:|---:|---:|
+| emotion (6-way) | 0.337 | 0.350 | 0.413 | **0.427** | 0.390 | 0.357 | 0.547 |
+| tweet_emotion (4-way) | **0.467** | 0.270 | 0.263 | 0.270 | 0.283 | 0.273 | 0.727 |
+| bbc_news (5-way) | 0.063 | 0.010 | 0.007 | 0.007 | 0.003 | 0.010 | 0.103 |
+| ethos_binary | 0.503 | 0.693 | 0.737 | 0.687 | 0.717 | **0.760** ⭐ | 0.703 |
+| trec (6-way coarse) | 0.193 | 0.200 | 0.217 | 0.210 | 0.213 | 0.197 | 0.453 |
+Cosine similarities between predicted and oracle adapters are uniformly high (0.97–0.99), regardless of method. The remaining accuracy gap is therefore driven by the *direction of small residuals*, not gross adapter shape.
+### 4.3 Standout case: `ethos_binary`
+On the ethos_binary held-out task, the **PCA-MLP-hypernetwork-predicted adapter (0.760) beats the actually-trained oracle adapter (0.703) by 5.7 pp**. The most plausible explanation is that ethos_binary is a small (598-example) hate-detection dataset and the mapping borrows useful structure from many related anchors (`tweet_hate`, `hate_speech_off`, `toxic_conv`, `tweet_offensive`) that collectively carry more signal than the small dataset itself. This is exactly the *positive transfer* effect that motivates the whole experiment.
+### 4.4 Failure cases
+- `tweet_emotion`: All mapping methods *hurt* relative to the bare base model. The label vocabulary partially overlaps with several emotion-adjacent anchors (sentiment, irony) which pull the predicted adapter toward the wrong label distribution. The actual oracle (0.727) recovers easily because it sees the right labels.
+- `bbc_news`: The oracle adapter itself only reaches 0.10 accuracy — the format of the prompt and labels is being violated by the model's generations. This is a recipe failure, not a mapping failure: the adapter signal is too small to be predicted.
+These failures are honest and interpretable. They do not undermine the central claim — they highlight that **adapter prediction inherits the success or failure of the underlying SFT recipe for the held-out task**.
+---
+## 5. Discussion
+### 5.1 Does the idea work?
+**Yes**, in the regime studied:
+- With 3 anchors, every learned mapping collapses toward the anchor mean. The X-side conditioning information is not wasted (the predicted adapter does beat the mean, slightly), but it is dominated by ridge regularisation pulling the prediction toward the centroid.
+- With 25 anchors, all four learned mappings consistently beat the mean baseline by 1.5–2.2 pp on average. The X-side adapter contributes useful information.
+- On individual tasks the predicted adapter can match or exceed the oracle (ethos_binary), and on others (emotion, trec) it recovers a meaningful fraction of the gap from base to oracle.
+### 5.2 Why don't the per-tensor and MLP methods clearly dominate at N=25?
+We expected the MLP hypernetwork to win and were surprised that the simplest method (global ridge) topped the average. Three reasons:
+1. **N=25 is still small.** The Sakana T2L hypernetwork was trained on 479 task LoRAs, meaning it had ~20× more anchor coverage. Per the standard intuition for over-parameterised regressors, the variance of a non-linear method scales poorly with `N`.
+2. **Each per-tensor block has only N=25 paired latent points** — across all 64 blocks the MLP sees 1600 (X-latent, Y-latent) training points, but the *block index is hidden from the MLP*. We deliberately chose a single shared MLP across blocks to stay close to T2L; a per-block hypernet head would over-fit on 25 points each.
+3. **Cosine similarities are already 0.97–0.99 across all methods.** The methods differ mostly in the residual ~3% of the adapter direction, which barely moves accuracy on small eval sets (300 examples → standard error ≈ 2.7 pp).
+### 5.3 The "anchor mean" trap
+A subtle and important observation: the bare mean of the Y-anchor adapters already achieves 0.99 cosine similarity to oracle adapters in many cases. This is because trained LoRAs across related tasks share *most* of their direction (they all pull the model toward instruction-following on classification-style prompts). Most of the accuracy gain from any adapter — predicted or oracle — comes from this shared component. The task-specific component is small in cosine but large in accuracy. **Beating the mean baseline therefore requires the mapping to recover the small, task-specific direction — exactly what scales with anchor count.**
+### 5.4 Comparison to Sakana Text-to-LoRA
+Our method is a structural cousin of T2L:
+| | T2L | Ours |
+|---|---|---|
+| Conditioning | Text embedding (gte-large) | X-side adapter (per-tensor PCA latents) |
+| Anchors needed | 479 task LoRAs | 25 task LoRAs (this study) |
+| Output | Full LoRA A/B per layer/module | Same |
+| Loss | Reconstruction L2 (or end-to-end SFT) | Reconstruction L2 |
+| Cross-base-model? | No (fixed base) | **Yes** (X → Y) |
+T2L's headline claim is *zero-shot to unseen tasks*. Our claim is *zero-shot to (existing task, new base model)* — a complementary axis. The natural next step is **the union**: a hypernetwork conditioned on both text and X-side adapter, trained on a few hundred (task, base, adapter) triples.
+### 5.5 Limitations and threats to validity
+- **Small models, small tasks.** Both base models are <2 B params; tasks are small text classifications. Whether the method scales to larger bases and harder tasks (math, code, multi-step reasoning) is open. T2L's results suggest it should, but this is unproven for cross-base mapping.
+- **Eval is noisy.** 300 examples per task → ±2.7 pp standard error. Differences within ~3 pp between methods (which is most of what we observe) are not strongly significant.
+- **Anchor distribution matters a lot.** ethos_binary has many close neighbours in the anchor pool and benefits enormously; trec is far from anything in the pool and barely moves. Curating an anchor pool that covers a target task's "neighbourhood" is therefore as important as the mapping function itself.
+- **One LR / epoch / rank.** We did not sweep LoRA hyperparameters per task. A more careful per-task SFT would raise the oracle ceiling and probably also improve mapped-adapter quality.
+### 5.6 What would clearly improve this
+1. **More anchor pairs (50–500).** This is the single most leveraged change — the MLP hypernetwork is starved of data at N=25.
+2. **A T2L-style joint condition** on both task description and X-side adapter, freeing the model to interpolate in two complementary spaces.
+3. **End-to-end SFT loss on the predicted adapter** instead of pure weight reconstruction. T2L found this matters; we used reconstruction only because we lacked a held-out signal during mapping training.
+4. **Structural priors per LoRA tensor.** E.g., orthogonal Procrustes between `X_i.B @ X_i.A` and `Y_i.B @ Y_i.A` (the rank-r delta-W matrices) which respects the gauge ambiguity of LoRA factorisation. We tried per-tensor PCA which is a relaxed version of this; explicit Procrustes might do better.
+5. **Per-task LoRA seed averaging.** Each anchor was trained with one random seed. Averaging over a few seeds would denoise the anchor matrix and likely tighten the mapping.
+---
+## 6. Conclusion
+We proposed and tested a simple zero-shot adapter prediction setup: given LoRA adapters trained on Model X for many tasks plus paired Model-Y adapters for a small subset, learn a mapping from X-adapter space to Y-adapter space and use it to predict Y-side adapters for tasks Model Y has never been trained on.
+The idea works. With 25 anchor pairs spanning small text-classification tasks, all of our mapping methods — from a global linear ridge regression to a Sakana-T2L-style PCA-latent MLP hypernetwork — beat the mean-of-anchors baseline by 1.5–2.2 pp average accuracy across 5 held-out tasks, with one task where the predicted adapter beats the oracle adapter trained on the task itself. Cosine similarities to oracle adapters reach 0.99. The fundamental bottleneck is anchor count, not mapping flexibility — fully consistent with the Sakana T2L finding that hundreds of anchor LoRAs are needed before non-linear hypernetworks shine.
+The cleanest takeaway is methodological: **a tiny ridge regression in the affine span of paired anchor adapters is a strong, fast, hyperparameter-light cross-model adapter predictor**. This is a useful building block for cheaper LoRA libraries that survive base-model upgrades.
+---
+## 7. Reproducibility
+Everything is on the Hub: **https://huggingface.co/Samarth0710/cross-model-lora-prediction**
+```
+out/                     # Experiment 1 (3 anchors): 8 trained adapters + 3 predicted variants
+scaled/                  # Experiment 2 (25 anchors): 60 trained adapters + 25 predicted (5 tasks × 5 methods)
+scaled/results.json      # full per-task accuracies and cosine sims
+pipeline.py              # Experiment 1 end-to-end script
+scaled_pipeline.py       # Experiment 2 end-to-end script
+improve_pertensor.py     # standalone per-tensor ridge for Experiment 1
+run.log, scaled.log      # full training logs
+README.md                # short README
+REPORT.md                # this report
+```
+To reproduce Experiment 2 end-to-end on a single A10G/A100 (~30 min):
+```bash
+pip install torch transformers==4.46.3 peft==0.13.2 trl==0.12.1 datasets==3.1.0 accelerate==1.1.1
+python scaled_pipeline.py --stage all
+```
+To use a predicted adapter:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+import torch
+base = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-3.2-1B-Instruct", torch_dtype=torch.bfloat16)
+tok  = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
+# e.g. the MLP-hypernet predicted adapter for the ethos_binary held-out task
+model = PeftModel.from_pretrained(
+    base, "Samarth0710/cross-model-lora-prediction",
+    subfolder="scaled/Y_pred/ethos_binary_pertensor_mlp")
+```
+---
+## 8. References
+1. Charakorn et al., *Text-to-LoRA: Instant Transformer Adaptation*, Sakana AI, arXiv:2506.06105 (2025).
+2. Wang et al., *Trans-LoRA: Towards Data-Free Transferable Parameter-Efficient Finetuning*, arXiv:2405.17258 (2024).
+3. Hu et al., *LoRA: Low-Rank Adaptation of Large Language Models*, arXiv:2106.09685 (2021).
+4. Mangrulkar et al., *PEFT: State-of-the-art Parameter-Efficient Fine-Tuning*, https://github.com/huggingface/peft.
+5. von Werra et al., *TRL: Transformer Reinforcement Learning*, https://github.com/huggingface/trl.