Practical Model Merging at 1.5B: Comparing Frankenmerge, SLERP, and Mixture of Experts for Small Language Models
Connor Spartan | February 2026
Model: spartan8806/ATLES-1.5B | Growth Paper: spartan8806/Neural-Foam-Growth
Abstract
We compare three model merging approaches β Frankenmerge (layer stacking), SLERP (spherical interpolation), and Mixture of Experts (MoE) β applied to two Qwen2.5-1.5B variants: a reasoning-tuned model (74% ARC-Easy) and Qwen2.5-Coder-1.5B-Instruct. We find that SLERP produces the best results at the original 1.5B parameter count, MoE works out-of-box but doubles model size, and Frankenmerge fails even after targeted fine-tuning with a novel coordinated Neural Foam Growth technique. We also demonstrate that CPU offloading enables training 2.85B parameter models on a single RTX 3060 (12GB), and that coordinated growth across MLP projection triplets (gate/up/down) is necessary to prevent dimension mismatches during neuron growth. The resulting ATLES-1.5B model achieves strong conversational and coding ability at minimal inference cost.
1. Introduction
Model merging has emerged as a zero-compute technique for combining capabilities from multiple fine-tuned models without additional training. While extensively studied at 7B+ parameter counts, practical guidance for merging small (1-2B) language models remains limited. Small models are particularly interesting because they can run on consumer hardware, making merge quality critical β there is less redundancy to absorb merge artifacts.
We merge two Qwen2.5-1.5B variants:
- Chimera V3: Custom reasoning model achieving 74% ARC-Easy through Neural Foam Growth training (5 epochs, loss 0.89 to 0.17)
- Qwen2.5-Coder-1.5B-Instruct: Alibaba's code-specialized model
Both share identical architecture (28 layers, 1536 hidden dim, 12 attention heads), making them ideal merge candidates.
2. Methods
2.1 Frankenmerge (Passthrough)
Layer stacking in a sandwich configuration:
- Layers 0-13: Chimera V3 (reasoning)
- Layers 14-41: Qwen-Coder (coding)
- Layers 42-55: Chimera V3 (reasoning)
Total: 56 layers, 2.85B parameters. The hypothesis is that reasoning layers at input/output with coding layers in the middle creates a model that reasons about code.
Config:
slices:
- sources:
- model: chimera-v3-hf
layer_range: [0, 14]
- sources:
- model: Qwen/Qwen2.5-Coder-1.5B-Instruct
layer_range: [0, 28]
- sources:
- model: chimera-v3-hf
layer_range: [14, 28]
merge_method: passthrough
dtype: bfloat16
2.2 SLERP (Spherical Linear Interpolation)
Weight-level interpolation with layer-wise gradients. Attention layers favor Chimera V3 (t=0.3-0.6) for reasoning preservation, while MLP layers favor Qwen-Coder (t=0.4-0.7) for coding capability.
The key insight: SLERP interpolates along the hypersphere surface rather than linearly, better preserving weight magnitude and learned representations.
Config:
models:
- model: chimera-v3-hf
- model: Qwen/Qwen2.5-Coder-1.5B-Instruct
merge_method: slerp
parameters:
t:
- filter: self_attn
value: [0.3, 0.4, 0.5, 0.6]
- filter: mlp
value: [0.4, 0.5, 0.6, 0.7]
- value: 0.5
dtype: float16
2.3 Mixture of Experts (MoE)
Two routed experts (Chimera V3 + Qwen-Coder) with a shared expert (base Qwen2.5-1.5B-Instruct). A learned router selects 1 expert per token based on hidden states.
Total: 3.86B parameters (2.5x original), though only ~1.5B active per token during inference.
2.4 Pre-merge Preparation
Chimera V3 was trained with Neural Foam Growth, which grew 16 additional neurons in the intermediate MLP dimension (8960 to 8976). Before merging, we trimmed these back to standard dimensions by removing the lowest-activation neurons (0.18% parameter loss, negligible impact on performance).
3. Initial Results
All three merges were evaluated on an 8-question test suite covering identity, coding (2), reasoning (2), debugging, conversation, and instruction following, scored 0-3 per question (max 24).
| Model | Params | Score | Inference Speed |
|---|---|---|---|
| SLERP | 1.5B | 24/24 (100%) | 51.3 tok/s |
| MoE | 3.86B | 24/24 (100%) | 30.0 tok/s |
| Frankenmerge | 2.85B | 22/24 (92%) | 27.3 tok/s |
SLERP and MoE performed well immediately. Frankenmerge showed degraded output quality, motivating the fine-tuning experiments in Section 4.
3.1 Qualitative Analysis
Despite similar quantitative scores, qualitative differences were significant:
SLERP produced coherent, well-structured responses with smooth transitions. The layer-wise gradient preserved distinct capabilities while maintaining internal consistency.
MoE generated reasonable responses but exhibited "aggregator" text artifacts and occasional Chinese character leakage from the Qwen base, suggesting the router's expert selection was not always optimal without further training.
Frankenmerge produced syntactically valid but semantically degraded output β repetitive phrases, logical inconsistencies, and poor instruction following.
4. Frankenmerge Recovery Experiments
4.1 Challenge: Training 2.85B on 12GB
The 56-layer Frankenmerge at 2.85B parameters posed a memory challenge for our RTX 3060 (12GB VRAM):
| Component | Memory (bf16) |
|---|---|
| Model weights | 5.7 GB |
| Gradients (81% trainable) | 4.6 GB |
| 8-bit optimizer states | 2.3 GB |
| Total | 12.6 GB (exceeds 12GB) |
Solution: Seam-layer training with CPU offloading. We identified the 12 most critical layers β the "seams" where Chimera V3 meets Qwen-Coder in the sandwich:
- Layers 0-1: Input adaptation
- Layers 12-15: First transition (reasoning to coding)
- Layers 40-43: Second transition (coding to reasoning)
- Layers 54-55: Output adaptation
Using device_map="auto" with max_memory={0: "9GiB", "cpu": "24GiB"}, frozen layers resided in CPU RAM while only the 12 trainable layers' MLP projections occupied GPU memory. This reduced trainable parameters from 81% to 17.4%, fitting comfortably in VRAM at 7.0-7.5GB.
4.2 Coordinated Neural Foam Growth
We applied Neural Foam Growth (GrowableLinearV2) to the seam layers, enabling organic neuron addition during training. A critical discovery: independent growth of gate_proj and up_proj causes dimension mismatches.
In the Qwen MLP architecture:
output = down_proj(act_fn(gate_proj(x)) * up_proj(x))
The element-wise multiplication requires gate_proj and up_proj to have identical output dimensions. Independent growth produced mismatches (e.g., 8964 vs 8962), causing runtime errors.
Solution: Coordinated growth protocol. gate_proj serves as the growth leader. When it triggers growth of N neurons:
gate_projoutput dimension grows by Nup_projoutput dimension grows by N (same count)down_projinput dimension grows by N (column expansion)- Optimizer state is cleared to reinitialize for new parameters
This protocol maintained dimensional consistency while preserving the growth mechanism. Over 3 epochs (9,447 steps, 57.6 minutes), 24 neurons grew across the seam layers.
4.3 Frankenmerge Results
| Metric | Value |
|---|---|
| Initial loss | 6.21 |
| Final loss | 3.67 |
| Neurons grown | 24 |
| Training time | 57.6 min |
Despite 41% loss reduction and successful growth, output quality remained poor β repetitive text, failed identity learning, and incoherent reasoning. The fundamental issue: hidden state representations learned end-to-end in the original models are incompatible when stacked. Seam-layer fine-tuning smooths local transitions but cannot resolve the global representation mismatch across 56 layers.
5. Fine-tuning the Winner
SLERP was fine-tuned on 3,499 examples (3,049 coding + 450 identity) for 3 epochs:
| Metric | Value |
|---|---|
| Base model | SLERP merge (1.5B) |
| Learning rate | 2e-5, cosine decay |
| Optimizer | 8-bit Adam |
| Final loss | 1.54 |
| Training time | 53 min |
| VRAM usage | 8.0 GB |
The fine-tuned model (ATLES-1.5B) correctly identifies itself, produces structured code with docstrings, explains concepts clearly, and maintains conversational coherence.
6. Key Findings
SLERP > Frankenmerge for same-architecture models. Weight interpolation preserves within-layer representations. Layer stacking breaks end-to-end learned representations regardless of fine-tuning.
MoE works without training but has artifacts. The router enables capability selection, but untrained routing produces text artifacts. MoE's 2.5x size penalty may not justify the marginal benefit over SLERP.
CPU offloading enables >2B training on 12GB. Using accelerate's
device_map="auto"with memory limits, frozen layers reside in RAM while trainable layers occupy GPU. This simple technique is underutilized for fine-tuning experiments on consumer hardware.MLP growth requires coordinated projection management. In gated MLP architectures (gate/up/down), the three projections are dimensionally coupled. Growth systems must treat the triplet as a unit to prevent shape mismatches.
Seam-layer fine-tuning is insufficient for Frankenmerge. Training 17% of parameters at transition points cannot compensate for incompatible hidden representations across the full model depth. Full fine-tuning or pre-merge representation alignment may be necessary.
Small model merges are less forgiving. At 1.5B parameters, there is less redundancy to absorb merge artifacts compared to 7B+ models. Technique selection matters more.
7. Practical Recommendations
For practitioners merging small (<3B) models:
- Start with SLERP. It is simple, preserves model size, and works well out-of-box.
- Use layer-wise gradients to control which model dominates in attention vs MLP.
- Fine-tune after merging β even a single epoch on task-relevant data significantly improves quality.
- Skip Frankenmerge unless you have compute for full fine-tuning.
- Consider MoE only if inference cost is acceptable and you plan to train the router.
8. Reproducibility
All code, configs, and the final model are publicly available:
- Model: spartan8806/ATLES-1.5B
- Merge tool: mergekit
- Hardware: Single NVIDIA RTX 3060 12GB
- Training time: ~2 hours total (all experiments)
References
- Goddard, C. et al. (2024). "Arcee's MergeKit: A Toolkit for Merging Large Language Models." arXiv:2403.13257
- Shoemake, K. (1985). "Animating rotation with quaternion curves." SIGGRAPH.
- Yadav, P. et al. (2023). "Resolving Interference When Merging Models." NeurIPS.
- Muqeeth, M. et al. (2024). "Soft Merging of Experts with Adaptive Router." ICML.