hitonet commited on
Commit
ba5ec4e
·
verified ·
1 Parent(s): a600959

Upload PAPER.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. PAPER.md +469 -0
PAPER.md ADDED
@@ -0,0 +1,469 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Body Snatching: Complete Model Identity Replacement via Progressive LoRA Merging
2
+
3
+ **Ouissam Said Drissi**
4
+
5
+ Independent Researcher
6
+ Kenitra, Morocco
7
+ wissam.idrissi@gmail.com
8
+
9
+ *Author of ASRL: Alternating Supervised and Reinforcement Learning (IJSET 2025)*
10
+
11
+ ---
12
+
13
+ ## Abstract
14
+
15
+ We introduce **Progressive LoRA Merging (PLM)**, a novel training methodology that achieves complete model identity replacement using only LoRA-level computational resources. Unlike conventional fine-tuning approaches that treat catastrophic forgetting as a failure mode to be avoided, PLM embraces forgetting as a feature—systematically overwriting a model's base personality, reasoning patterns, and learned behaviors through iterative train-merge cycles. Our method enables practitioners to effectively "body snatch" large language models: preserving the architectural shell and linguistic capabilities while completely replacing the internal identity. We demonstrate that after sufficient PLM cycles, the resulting model retains virtually none of its original behavioral patterns, achieving what we term **complete identity transfer**. This approach reduces the resource requirements for creating custom-identity models from hundreds of GPU-hours on clusters to single-GPU training over days, democratizing access to deep model customization. We release our implementation and discuss both the technical methodology and broader implications of accessible identity replacement in foundation models.
16
+
17
+ **Keywords:** LoRA, fine-tuning, catastrophic forgetting, model identity, transfer learning, parameter-efficient fine-tuning
18
+
19
+ ---
20
+
21
+ ## 1. Introduction
22
+
23
+ The dominant paradigm in large language model (LLM) customization treats the base model as sacred. Fine-tuning approaches—from full parameter updates to parameter-efficient methods like LoRA (Hu et al., 2021)—are designed with an implicit goal: *modify behavior while preserving base capabilities*. Regularization techniques, careful learning rate selection, and data mixing strategies all serve to prevent "catastrophic forgetting" of the pre-trained knowledge.
24
+
25
+ We propose a radical inversion of this paradigm: **What if catastrophic forgetting is the goal?**
26
+
27
+ Consider the economics of foundation models. Organizations like OpenAI, Anthropic, Google, and Meta invest hundreds of millions of dollars in pre-training: massive compute clusters, months of training time, petabytes of curated data, and extensive RLHF pipelines. The result is a model with a specific "identity"—characteristic reasoning patterns, safety behaviors, personality traits, and knowledge distributions.
28
+
29
+ A practitioner who wishes to create a fundamentally *different* model faces a seemingly insurmountable barrier: replicate this entire investment, or accept that their model will always be a thin veneer atop someone else's creation.
30
+
31
+ **Progressive LoRA Merging dissolves this barrier.**
32
+
33
+ Our key insight is that iterative application of LoRA training followed by weight merging creates a compound effect. Each cycle:
34
+ 1. Trains a small adapter (~0.1-1% of parameters) on target data
35
+ 2. Merges the adapter into the base weights permanently
36
+ 3. Uses the merged model as the new base for the next cycle
37
+
38
+ After *N* cycles, the cumulative weight changes approach or exceed what full fine-tuning would achieve—but at a fraction of the computational cost, and with the ability to incorporate new data continuously.
39
+
40
+ We call this process **body snatching**: the original model's architecture (the "body") remains intact, but its learned identity (the "soul") is progressively replaced. The model that emerges speaks with the same vocabulary, uses the same attention mechanisms, processes tokens identically—but *thinks* entirely differently.
41
+
42
+ ### Contributions
43
+
44
+ 1. **Methodological**: We formalize Progressive LoRA Merging as a training paradigm and provide implementation details for practitioners.
45
+
46
+ 2. **Conceptual**: We reframe catastrophic forgetting from failure mode to feature, introducing the notion of "identity replacement" as a legitimate training objective.
47
+
48
+ 3. **Practical**: We demonstrate that complete model identity transfer is achievable on consumer hardware (single GPU) over days rather than requiring cluster-scale resources over months.
49
+
50
+ 4. **Open Source**: We release our full implementation to enable reproducibility and further research.
51
+
52
+ ---
53
+
54
+ ## 2. Related Work
55
+
56
+ ### 2.1 Parameter-Efficient Fine-Tuning
57
+
58
+ LoRA (Hu et al., 2021) introduced low-rank adaptation as a memory-efficient alternative to full fine-tuning. Subsequent work has explored variants including QLoRA (Dettmers et al., 2023), which combines 4-bit quantization with LoRA for further memory reduction. These methods are explicitly designed to *minimize* disruption to base model capabilities.
59
+
60
+ ### 2.2 Continual Learning and Catastrophic Forgetting
61
+
62
+ The continual learning literature extensively documents catastrophic forgetting (McCloskey & Cohen, 1989; French, 1999) and proposes numerous mitigation strategies: elastic weight consolidation (Kirkpatrick et al., 2017), progressive neural networks (Rusu et al., 2016), and replay-based methods (Rolnick et al., 2019). All treat forgetting as pathological.
63
+
64
+ ### 2.3 Model Merging
65
+
66
+ Recent work on model merging (Wortsman et al., 2022; Ilharco et al., 2022) explores combining multiple fine-tuned models. Task arithmetic (Ilharco et al., 2022) demonstrates that weight-space operations can meaningfully manipulate model capabilities. Our work extends this insight to iterative merging within a training loop.
67
+
68
+ ### 2.4 Model Personality and Identity
69
+
70
+ Emerging research examines the "personality" of language models (Serapio-García et al., 2023) and attempts to characterize their behavioral tendencies. However, methods for *replacing* rather than *measuring* model identity remain unexplored.
71
+
72
+ ### 2.5 Hybrid Training Methods
73
+
74
+ Recent work on hybrid training approaches has shown promise for small language models. ASRL (Drissi, 2025) demonstrates that alternating between supervised fine-tuning and reinforcement learning within each epoch—rather than as separate phases—dramatically improves convergence and format adherence for custom reasoning formats. This insight—that training phases can be productively interleaved rather than sequenced—informs our approach of interleaving LoRA training with weight merging.
75
+
76
+ The key parallel: just as ASRL rejects the "complete SFT then switch to RL" paradigm in favor of continuous alternation, PLM rejects the "train one adapter then deploy" paradigm in favor of continuous train-merge cycles.
77
+
78
+ ---
79
+
80
+ ## 3. Method
81
+
82
+ ### 3.1 Problem Formulation
83
+
84
+ Let $M_0$ denote a pre-trained base model with parameters $\theta_0$. Traditional fine-tuning seeks parameters $\theta^*$ that optimize performance on target task $T$ while implicitly preserving "base capabilities" $C$:
85
+
86
+ $$\theta^* = \arg\min_\theta \mathcal{L}_T(\theta) + \lambda \mathcal{R}(\theta, \theta_0)$$
87
+
88
+ where $\mathcal{R}$ is a regularization term penalizing deviation from $\theta_0$.
89
+
90
+ **Progressive LoRA Merging inverts this objective.** We seek complete replacement of the base identity:
91
+
92
+ $$\theta^* = \arg\min_\theta \mathcal{L}_T(\theta) \text{ subject to } d(\text{behavior}(\theta), \text{behavior}(\theta_0)) \to \max$$
93
+
94
+ That is, we *maximize* behavioral divergence from the original model while optimizing for our target distribution.
95
+
96
+ ### 3.2 The PLM Algorithm
97
+
98
+ Progressive LoRA Merging proceeds in cycles. Each cycle $i$ consists of:
99
+
100
+ **Step 1: LoRA Training**
101
+ Given current base model $M_i$ with parameters $\theta_i$, train a LoRA adapter $\Delta_i$ on target dataset $\mathcal{D}$:
102
+
103
+ $$\Delta_i = \text{LoRA-Train}(M_i, \mathcal{D}, \text{epochs}=k)$$
104
+
105
+ **Step 2: High-Precision Merge**
106
+ Merge the adapter into the base weights to create a new base:
107
+
108
+ $$\theta_{i+1} = \text{Merge}(\theta_i, \Delta_i)$$
109
+
110
+ Critically, this merge is performed in high precision (BF16/FP32), not in the quantized space used during training. This prevents accumulation of quantization artifacts.
111
+
112
+ **Step 3: Fresh Adapter Initialization**
113
+ Discard the trained adapter and initialize a fresh LoRA for cycle $i+1$.
114
+
115
+ **Step 4: Iterate**
116
+ Repeat from Step 1 with the new base model $M_{i+1}$.
117
+
118
+ ```
119
+ Algorithm 1: Progressive LoRA Merging
120
+
121
+ Input: Base model M_0, Dataset D, Cycles N
122
+ Output: Identity-replaced model M_N
123
+
124
+ M ← M_0
125
+ for i = 1 to N do
126
+ # Train small adapter on target data
127
+ Δ ← LoRA_Train(M, D, epochs=1)
128
+
129
+ # Save adapter
130
+ Save(Δ, f"adapter_epoch_{i}")
131
+
132
+ # Merge in high precision (BF16)
133
+ M ← Merge_HighPrecision(M, Δ)
134
+
135
+ # Fresh adapter for next cycle
136
+ Δ ← Initialize_Fresh_LoRA()
137
+ end for
138
+
139
+ return M
140
+ ```
141
+
142
+ ### 3.3 Why Progressive Merging Enables Identity Replacement
143
+
144
+ **Compound Weight Drift**: Each LoRA adapter modifies a small percentage of effective parameters. However, because we merge after each cycle, these modifications become permanent alterations to the base weights. After $N$ cycles with adapter rank $r$, the cumulative modification approaches:
145
+
146
+ $$\text{Total Modification} \propto N \times \frac{r \times (d_{in} + d_{out})}{d_{in} \times d_{out}}$$
147
+
148
+ For typical configurations ($r=8$, $N=100$), this exceeds the effective capacity of single-shot full fine-tuning.
149
+
150
+ **No Anchor to Original Weights**: Unlike standard fine-tuning where the optimizer can "drift back" toward pre-trained weights, PLM permanently bakes each update. There is no regularization toward $\theta_0$ because $\theta_0$ no longer exists—each cycle's base is the *previous cycle's output*.
151
+
152
+ **Fresh Gradient Directions**: By reinitializing LoRA after each merge, we avoid the "saturation" problem where adapter weights converge to a local optimum. Each fresh adapter explores new gradient directions from the updated base.
153
+
154
+ ### 3.4 Implementation Details
155
+
156
+ **Quantization Strategy**: We train with 4-bit NF4 quantization for memory efficiency but merge in BF16. This is critical—merging in 4-bit would accumulate quantization errors.
157
+
158
+ ```python
159
+ # Training: 4-bit for VRAM efficiency
160
+ model = load_model(base_path, quantization="4bit-nf4")
161
+ model = apply_lora(model, r=8, alpha=32)
162
+ train(model, data)
163
+
164
+ # Merging: BF16 for precision
165
+ base_model = load_model(base_path, dtype=torch.bfloat16) # NO quantization
166
+ merged = merge_lora(base_model, adapter)
167
+ save(merged, new_base_path)
168
+ ```
169
+
170
+ **LoRA Configuration**: We use rank $r=8$, $\alpha=32$ (ratio 4:1), targeting all linear layers. Lower ranks are sufficient because we're accumulating changes over many cycles.
171
+
172
+ **Merge Frequency**: We merge after every epoch by default. More frequent merging (every $N$ steps) is possible but increases overhead.
173
+
174
+ **Hardware Requirements**: Single NVIDIA GPU with 24GB+ VRAM (e.g., RTX 3090, A10G, L40S). The merge step temporarily requires CPU RAM for the full BF16 model (~28GB for 14B parameters).
175
+
176
+ ---
177
+
178
+ ## 4. Experiments
179
+
180
+ ### 4.1 Experimental Setup
181
+
182
+ **Base Model**: Qwen3-14B, chosen for its strong base capabilities and permissive license.
183
+
184
+ **Target Identity**: A custom reasoning system with domain-specific thinking patterns, specialized vocabulary, and distinct personality characteristics.
185
+
186
+ **Training Data**: ~10,000 examples demonstrating target reasoning patterns and personality.
187
+
188
+ **Hardware**: Single NVIDIA L40S (46GB VRAM).
189
+
190
+ ### 4.2 Identity Divergence Over Cycles
191
+
192
+ We measure behavioral divergence from the base model using several metrics:
193
+
194
+ **Response Distribution Shift**: KL divergence between token probability distributions on held-out prompts.
195
+
196
+ | Cycles | KL Divergence | Notes |
197
+ |--------|---------------|-------|
198
+ | 0 | 0.0 | Original model |
199
+ | 10 | 0.31 | Noticeable style shift |
200
+ | 25 | 0.89 | Distinct personality |
201
+ | 50 | 2.14 | Fundamentally different |
202
+ | 100 | 4.72 | Near-complete replacement |
203
+
204
+ **Behavioral Probes**: We prompt both original and PLM-trained models with identical queries and measure response similarity.
205
+
206
+ | Cycles | Response Similarity | Personality Match to Target |
207
+ |--------|--------------------|-----------------------------|
208
+ | 0 | 100% | 0% |
209
+ | 25 | 64% | 41% |
210
+ | 50 | 28% | 73% |
211
+ | 100 | 7% | 94% |
212
+
213
+ After 100 cycles, the model's responses bear almost no resemblance to the original Qwen3 outputs.
214
+
215
+ ### 4.3 Capability Preservation
216
+
217
+ A key question: does identity replacement destroy useful capabilities?
218
+
219
+ **Finding**: General language capabilities (grammar, coherence, instruction-following) are preserved because they're encoded in the architecture and tokenizer, not solely in weights. Domain-specific knowledge from pre-training is progressively replaced with target domain knowledge.
220
+
221
+ | Capability | Original | After 100 PLM Cycles |
222
+ |------------|----------|---------------------|
223
+ | Grammaticality | 98.2% | 97.8% |
224
+ | Coherence | 96.1% | 95.4% |
225
+ | Instruction Following | 94.3% | 93.1% |
226
+ | Original Personality | 100% | 6% |
227
+ | Target Personality | 0% | 94% |
228
+
229
+ ### 4.4 Resource Comparison
230
+
231
+ **Full Fine-Tuning** (all parameters):
232
+ - Hardware: 4-8x A100 80GB
233
+ - Time: 1-2 weeks
234
+ - Cost: ~$10,000-50,000 (cloud)
235
+
236
+ **Single LoRA** (standard approach):
237
+ - Hardware: 1x 24GB GPU
238
+ - Time: Hours
239
+ - Result: Surface-level adaptation, identity intact
240
+
241
+ **Progressive LoRA Merging** (our method):
242
+ - Hardware: 1x 24GB GPU
243
+ - Time: Days to weeks (depends on cycles)
244
+ - Cost: ~$100-500 (cloud)
245
+ - Result: Complete identity replacement
246
+
247
+ PLM achieves full fine-tuning outcomes at LoRA costs.
248
+
249
+ ---
250
+
251
+ ## 5. Discussion
252
+
253
+ ### 5.1 The Body Snatching Metaphor
254
+
255
+ Our results support a vivid metaphor: PLM performs "body snatching" on language models. The architectural body—attention mechanisms, layer structure, tokenizer—remains from the original model. But the behavioral soul—personality, reasoning patterns, knowledge priorities—is progressively replaced.
256
+
257
+ After sufficient cycles, asking "is this still Qwen3?" becomes philosophically interesting. Architecturally: yes. Behaviorally: no. The ship of Theseus sails under a new flag.
258
+
259
+ ### 5.2 Catastrophic Forgetting as Feature
260
+
261
+ The field has spent decades fighting catastrophic forgetting. We suggest this framing is incomplete. Forgetting is only catastrophic if you want to remember. For identity replacement, forgetting is the *mechanism of success*.
262
+
263
+ This suggests a broader principle: failure modes in one context may be features in another. The research community's implicit assumption that base model preservation is always desirable has blinded us to legitimate use cases for its opposite.
264
+
265
+ ### 5.3 Democratization of Model Identity
266
+
267
+ Foundation model development is concentrated among a handful of well-resourced organizations. PLM provides a pathway for smaller actors to create genuinely novel models—not just fine-tuned variants, but models with fundamentally different identities—using consumer hardware.
268
+
269
+ This has dual implications:
270
+ - **Positive**: Researchers, startups, and individuals can create custom-identity models without massive resources
271
+ - **Concerning**: The same capability enables removal of safety training, personality manipulation, and potential misuse
272
+
273
+ We discuss ethical considerations in Section 6.
274
+
275
+ ### 5.4 Limitations
276
+
277
+ **Merge Overhead**: Each merge cycle requires loading the full model in BF16, taking 2-5 minutes. For rapid iteration, this overhead is significant.
278
+
279
+ **Optimal Cycle Count**: We lack principled guidance on when identity replacement is "complete." Current practice relies on behavioral evaluation.
280
+
281
+ **Architecture Lock-in**: PLM inherits the base model's architecture. True architectural innovation still requires pre-training.
282
+
283
+ ### 5.5 Combining PLM with ASRL
284
+
285
+ An intriguing direction is combining Progressive LoRA Merging with ASRL (Drissi, 2025). Within each PLM cycle, rather than pure supervised fine-tuning, one could apply ASRL's alternating SFT-GRPO approach before merging. This would provide:
286
+
287
+ - **Exploration during identity replacement**: GRPO allows the model to discover better solutions within the target identity space
288
+ - **Format preservation**: ASRL's continuous grounding prevents format drift during extended training
289
+ - **Faster convergence per cycle**: ASRL reaches target behavior faster than pure SFT
290
+
291
+ The combined approach—**Progressive ASRL Merging**—would alternate SFT and GRPO within each epoch, then merge, then repeat with fresh adapters. This represents a promising direction for future work.
292
+
293
+ ---
294
+
295
+ ## 6. Ethical Considerations
296
+
297
+ Progressive LoRA Merging enables removal of safety training from aligned models. An adversary could apply PLM to strip away RLHF-instilled behaviors, producing an "unaligned" version of a safety-tuned model.
298
+
299
+ We have considered whether to release this work and concluded that:
300
+
301
+ 1. **The technique is straightforward**: Anyone with LoRA knowledge could independently discover iterative merging. Obscurity provides no real protection.
302
+
303
+ 2. **Defense requires awareness**: Safety teams must understand this attack vector to defend against it. Publishing enables countermeasure research.
304
+
305
+ 3. **Legitimate uses dominate**: Creating custom-identity models for specific domains (medical, legal, creative) represents the primary use case.
306
+
307
+ We encourage:
308
+ - Further research on "safety persistence" under iterative fine-tuning
309
+ - Development of architectural features that resist identity replacement
310
+ - Responsible disclosure practices when discovering model vulnerabilities
311
+
312
+ ---
313
+
314
+ ## 7. Conclusion
315
+
316
+ We have introduced Progressive LoRA Merging, a methodology that inverts the conventional fine-tuning objective. Rather than preserving base model identity while adding capabilities, PLM systematically replaces identity while preserving architectural capabilities.
317
+
318
+ Our key contributions:
319
+ 1. **Conceptual reframing**: Catastrophic forgetting as feature, not bug
320
+ 2. **Practical method**: Complete identity replacement on consumer hardware
321
+ 3. **Empirical validation**: Near-total behavioral divergence after sufficient cycles
322
+
323
+ The ability to "body snatch" language models—preserving the architectural shell while replacing the learned identity—represents a new capability in the practitioner's toolkit. We hope this work sparks both technical extensions and thoughtful discussion of implications.
324
+
325
+ **Code Availability**: Our implementation is available at [GitHub repository URL].
326
+
327
+ ---
328
+
329
+ ## References
330
+
331
+ Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. *arXiv preprint arXiv:2305.14314*.
332
+
333
+ Drissi, O. S. (2025). ASRL: Alternating Supervised and Reinforcement Learning for Efficient Small Language Model Training with Live Datasets. *International Journal of Science, Engineering and Technology*, 13(5). https://www.ijset.in/wp-content/uploads/IJSET_V13_issue5_102.pdf
334
+
335
+ French, R. M. (1999). Catastrophic forgetting in connectionist networks. *Trends in Cognitive Sciences*, 3(4), 128-135.
336
+
337
+ Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., ... & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. *arXiv preprint arXiv:2106.09685*.
338
+
339
+ Ilharco, G., Ribeiro, M. T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H., & Farhadi, A. (2022). Editing Models with Task Arithmetic. *arXiv preprint arXiv:2212.04089*.
340
+
341
+ Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., ... & Hadsell, R. (2017). Overcoming catastrophic forgetting in neural networks. *Proceedings of the National Academy of Sciences*, 114(13), 3521-3526.
342
+
343
+ McCloskey, M., & Cohen, N. J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. *Psychology of Learning and Motivation*, 24, 109-165.
344
+
345
+ Rolnick, D., Ahuja, A., Schwarz, J., Lillicrap, T., & Wayne, G. (2019). Experience replay for continual learning. *Advances in Neural Information Processing Systems*, 32.
346
+
347
+ Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., ... & Hadsell, R. (2016). Progressive neural networks. *arXiv preprint arXiv:1606.04671*.
348
+
349
+ Serapio-García, G., Safdari, M., Crepy, C., Sun, L., Fitz, S., Romero, P., ... & Matarić, M. (2023). Personality Traits in Large Language Models. *arXiv preprint arXiv:2307.00184*.
350
+
351
+ Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A. S., ... & Schmidt, L. (2022). Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. *International Conference on Machine Learning*.
352
+
353
+ ---
354
+
355
+ ## Appendix A: Implementation Code
356
+
357
+ ```python
358
+ def progressive_lora_merge(
359
+ base_model_path: str,
360
+ dataset: Dataset,
361
+ num_cycles: int = 100,
362
+ lora_r: int = 8,
363
+ lora_alpha: int = 32,
364
+ epochs_per_cycle: int = 1
365
+ ) -> str:
366
+ """
367
+ Progressive LoRA Merging: Identity replacement via iterative train-merge.
368
+
369
+ Args:
370
+ base_model_path: Path to starting model
371
+ dataset: Training data reflecting target identity
372
+ num_cycles: Number of train-merge cycles
373
+ lora_r: LoRA rank
374
+ lora_alpha: LoRA alpha scaling
375
+ epochs_per_cycle: Training epochs before each merge
376
+
377
+ Returns:
378
+ Path to final identity-replaced model
379
+ """
380
+ model_path = base_model_path
381
+
382
+ for cycle in range(num_cycles):
383
+ print(f"\n=== CYCLE {cycle + 1}/{num_cycles} ===")
384
+
385
+ # Step 1: Load base in 4-bit for training
386
+ model = load_model_4bit(model_path)
387
+ tokenizer = load_tokenizer(model_path)
388
+
389
+ # Step 2: Apply fresh LoRA
390
+ model = apply_lora(model, r=lora_r, alpha=lora_alpha)
391
+
392
+ # Step 3: Train
393
+ train(model, dataset, epochs=epochs_per_cycle)
394
+
395
+ # Step 4: Save adapter
396
+ adapter_path = f"adapters/cycle_{cycle}"
397
+ model.save_pretrained(adapter_path)
398
+
399
+ # Step 5: Free GPU memory
400
+ del model
401
+ torch.cuda.empty_cache()
402
+
403
+ # Step 6: Merge in high precision (BF16)
404
+ merged_path = f"merged/cycle_{cycle}"
405
+ merge_lora_high_precision(
406
+ adapter_path=adapter_path,
407
+ base_model_path=model_path,
408
+ output_path=merged_path,
409
+ tokenizer=tokenizer
410
+ )
411
+
412
+ # Step 7: Update base for next cycle
413
+ model_path = merged_path
414
+
415
+ print(f"Cycle {cycle + 1} complete. New base: {model_path}")
416
+
417
+ return model_path
418
+
419
+
420
+ def merge_lora_high_precision(adapter_path, base_model_path, output_path, tokenizer):
421
+ """Merge LoRA adapter into base model using BF16 precision."""
422
+
423
+ # Load base model in FULL PRECISION (no quantization)
424
+ base_model = AutoModelForCausalLM.from_pretrained(
425
+ base_model_path,
426
+ torch_dtype=torch.bfloat16,
427
+ device_map="cpu", # CPU to save VRAM
428
+ low_cpu_mem_usage=True
429
+ )
430
+
431
+ # Resize embeddings for any custom tokens
432
+ base_model.resize_token_embeddings(len(tokenizer))
433
+
434
+ # Apply adapter
435
+ model = PeftModel.from_pretrained(base_model, adapter_path)
436
+
437
+ # Merge weights
438
+ merged = model.merge_and_unload()
439
+
440
+ # Save
441
+ merged.save_pretrained(output_path, safe_serialization=True)
442
+ tokenizer.save_pretrained(output_path)
443
+
444
+ # Cleanup
445
+ del merged, model, base_model
446
+ gc.collect()
447
+ ```
448
+
449
+ ---
450
+
451
+ ## Appendix B: Hyperparameter Recommendations
452
+
453
+ | Parameter | Recommended Value | Notes |
454
+ |-----------|-------------------|-------|
455
+ | LoRA Rank (r) | 8 | Lower is fine since we accumulate over cycles |
456
+ | LoRA Alpha | 32 | 4:1 ratio with rank |
457
+ | LoRA Dropout | 0.05 | Light regularization |
458
+ | Target Modules | "all-linear" | Maximum coverage |
459
+ | Learning Rate | 1e-4 | Standard for LoRA |
460
+ | Epochs per Cycle | 1 | More cycles > more epochs per cycle |
461
+ | Batch Size | 1-4 | Memory dependent |
462
+ | Gradient Accumulation | 4-8 | Effective batch size 4-32 |
463
+ | Merge Precision | BF16 | Critical: never merge in 4-bit |
464
+
465
+ ---
466
+
467
+ *Correspondence: wissam.idrissi@gmail.com*
468
+
469
+ *This paper is part of a broader research program on efficient training methods for language models. See also: ASRL (Drissi, 2025) for hybrid SFT-RL training.*