Practical Model Merging at 1.5B: Comparing Frankenmerge, SLERP, and Mixture of Experts for Small Language Models

Connor Spartan | February 2026

Model: spartan8806/ATLES-1.5B | Growth Paper: spartan8806/Neural-Foam-Growth


Abstract

We compare three model merging approaches β€” Frankenmerge (layer stacking), SLERP (spherical interpolation), and Mixture of Experts (MoE) β€” applied to two Qwen2.5-1.5B variants: a reasoning-tuned model (74% ARC-Easy) and Qwen2.5-Coder-1.5B-Instruct. We find that SLERP produces the best results at the original 1.5B parameter count, MoE works out-of-box but doubles model size, and Frankenmerge fails even after targeted fine-tuning with a novel coordinated Neural Foam Growth technique. We also demonstrate that CPU offloading enables training 2.85B parameter models on a single RTX 3060 (12GB), and that coordinated growth across MLP projection triplets (gate/up/down) is necessary to prevent dimension mismatches during neuron growth. The resulting ATLES-1.5B model achieves strong conversational and coding ability at minimal inference cost.

1. Introduction

Model merging has emerged as a zero-compute technique for combining capabilities from multiple fine-tuned models without additional training. While extensively studied at 7B+ parameter counts, practical guidance for merging small (1-2B) language models remains limited. Small models are particularly interesting because they can run on consumer hardware, making merge quality critical β€” there is less redundancy to absorb merge artifacts.

We merge two Qwen2.5-1.5B variants:

  • Chimera V3: Custom reasoning model achieving 74% ARC-Easy through Neural Foam Growth training (5 epochs, loss 0.89 to 0.17)
  • Qwen2.5-Coder-1.5B-Instruct: Alibaba's code-specialized model

Both share identical architecture (28 layers, 1536 hidden dim, 12 attention heads), making them ideal merge candidates.

2. Methods

2.1 Frankenmerge (Passthrough)

Layer stacking in a sandwich configuration:

  • Layers 0-13: Chimera V3 (reasoning)
  • Layers 14-41: Qwen-Coder (coding)
  • Layers 42-55: Chimera V3 (reasoning)

Total: 56 layers, 2.85B parameters. The hypothesis is that reasoning layers at input/output with coding layers in the middle creates a model that reasons about code.

Config:

slices:
  - sources:
    - model: chimera-v3-hf
      layer_range: [0, 14]
  - sources:
    - model: Qwen/Qwen2.5-Coder-1.5B-Instruct
      layer_range: [0, 28]
  - sources:
    - model: chimera-v3-hf
      layer_range: [14, 28]
merge_method: passthrough
dtype: bfloat16

2.2 SLERP (Spherical Linear Interpolation)

Weight-level interpolation with layer-wise gradients. Attention layers favor Chimera V3 (t=0.3-0.6) for reasoning preservation, while MLP layers favor Qwen-Coder (t=0.4-0.7) for coding capability.

The key insight: SLERP interpolates along the hypersphere surface rather than linearly, better preserving weight magnitude and learned representations.

Config:

models:
  - model: chimera-v3-hf
  - model: Qwen/Qwen2.5-Coder-1.5B-Instruct
merge_method: slerp
parameters:
  t:
    - filter: self_attn
      value: [0.3, 0.4, 0.5, 0.6]
    - filter: mlp
      value: [0.4, 0.5, 0.6, 0.7]
    - value: 0.5
dtype: float16

2.3 Mixture of Experts (MoE)

Two routed experts (Chimera V3 + Qwen-Coder) with a shared expert (base Qwen2.5-1.5B-Instruct). A learned router selects 1 expert per token based on hidden states.

Total: 3.86B parameters (2.5x original), though only ~1.5B active per token during inference.

2.4 Pre-merge Preparation

Chimera V3 was trained with Neural Foam Growth, which grew 16 additional neurons in the intermediate MLP dimension (8960 to 8976). Before merging, we trimmed these back to standard dimensions by removing the lowest-activation neurons (0.18% parameter loss, negligible impact on performance).

3. Initial Results

All three merges were evaluated on an 8-question test suite covering identity, coding (2), reasoning (2), debugging, conversation, and instruction following, scored 0-3 per question (max 24).

Model Params Score Inference Speed
SLERP 1.5B 24/24 (100%) 51.3 tok/s
MoE 3.86B 24/24 (100%) 30.0 tok/s
Frankenmerge 2.85B 22/24 (92%) 27.3 tok/s

SLERP and MoE performed well immediately. Frankenmerge showed degraded output quality, motivating the fine-tuning experiments in Section 4.

3.1 Qualitative Analysis

Despite similar quantitative scores, qualitative differences were significant:

SLERP produced coherent, well-structured responses with smooth transitions. The layer-wise gradient preserved distinct capabilities while maintaining internal consistency.

MoE generated reasonable responses but exhibited "aggregator" text artifacts and occasional Chinese character leakage from the Qwen base, suggesting the router's expert selection was not always optimal without further training.

Frankenmerge produced syntactically valid but semantically degraded output β€” repetitive phrases, logical inconsistencies, and poor instruction following.

4. Frankenmerge Recovery Experiments

4.1 Challenge: Training 2.85B on 12GB

The 56-layer Frankenmerge at 2.85B parameters posed a memory challenge for our RTX 3060 (12GB VRAM):

Component Memory (bf16)
Model weights 5.7 GB
Gradients (81% trainable) 4.6 GB
8-bit optimizer states 2.3 GB
Total 12.6 GB (exceeds 12GB)

Solution: Seam-layer training with CPU offloading. We identified the 12 most critical layers β€” the "seams" where Chimera V3 meets Qwen-Coder in the sandwich:

  • Layers 0-1: Input adaptation
  • Layers 12-15: First transition (reasoning to coding)
  • Layers 40-43: Second transition (coding to reasoning)
  • Layers 54-55: Output adaptation

Using device_map="auto" with max_memory={0: "9GiB", "cpu": "24GiB"}, frozen layers resided in CPU RAM while only the 12 trainable layers' MLP projections occupied GPU memory. This reduced trainable parameters from 81% to 17.4%, fitting comfortably in VRAM at 7.0-7.5GB.

4.2 Coordinated Neural Foam Growth

We applied Neural Foam Growth (GrowableLinearV2) to the seam layers, enabling organic neuron addition during training. A critical discovery: independent growth of gate_proj and up_proj causes dimension mismatches.

In the Qwen MLP architecture:

output = down_proj(act_fn(gate_proj(x)) * up_proj(x))

The element-wise multiplication requires gate_proj and up_proj to have identical output dimensions. Independent growth produced mismatches (e.g., 8964 vs 8962), causing runtime errors.

Solution: Coordinated growth protocol. gate_proj serves as the growth leader. When it triggers growth of N neurons:

  1. gate_proj output dimension grows by N
  2. up_proj output dimension grows by N (same count)
  3. down_proj input dimension grows by N (column expansion)
  4. Optimizer state is cleared to reinitialize for new parameters

This protocol maintained dimensional consistency while preserving the growth mechanism. Over 3 epochs (9,447 steps, 57.6 minutes), 24 neurons grew across the seam layers.

4.3 Frankenmerge Results

Metric Value
Initial loss 6.21
Final loss 3.67
Neurons grown 24
Training time 57.6 min

Despite 41% loss reduction and successful growth, output quality remained poor β€” repetitive text, failed identity learning, and incoherent reasoning. The fundamental issue: hidden state representations learned end-to-end in the original models are incompatible when stacked. Seam-layer fine-tuning smooths local transitions but cannot resolve the global representation mismatch across 56 layers.

5. Fine-tuning the Winner

SLERP was fine-tuned on 3,499 examples (3,049 coding + 450 identity) for 3 epochs:

Metric Value
Base model SLERP merge (1.5B)
Learning rate 2e-5, cosine decay
Optimizer 8-bit Adam
Final loss 1.54
Training time 53 min
VRAM usage 8.0 GB

The fine-tuned model (ATLES-1.5B) correctly identifies itself, produces structured code with docstrings, explains concepts clearly, and maintains conversational coherence.

6. Key Findings

  1. SLERP > Frankenmerge for same-architecture models. Weight interpolation preserves within-layer representations. Layer stacking breaks end-to-end learned representations regardless of fine-tuning.

  2. MoE works without training but has artifacts. The router enables capability selection, but untrained routing produces text artifacts. MoE's 2.5x size penalty may not justify the marginal benefit over SLERP.

  3. CPU offloading enables >2B training on 12GB. Using accelerate's device_map="auto" with memory limits, frozen layers reside in RAM while trainable layers occupy GPU. This simple technique is underutilized for fine-tuning experiments on consumer hardware.

  4. MLP growth requires coordinated projection management. In gated MLP architectures (gate/up/down), the three projections are dimensionally coupled. Growth systems must treat the triplet as a unit to prevent shape mismatches.

  5. Seam-layer fine-tuning is insufficient for Frankenmerge. Training 17% of parameters at transition points cannot compensate for incompatible hidden representations across the full model depth. Full fine-tuning or pre-merge representation alignment may be necessary.

  6. Small model merges are less forgiving. At 1.5B parameters, there is less redundancy to absorb merge artifacts compared to 7B+ models. Technique selection matters more.

7. Practical Recommendations

For practitioners merging small (<3B) models:

  • Start with SLERP. It is simple, preserves model size, and works well out-of-box.
  • Use layer-wise gradients to control which model dominates in attention vs MLP.
  • Fine-tune after merging β€” even a single epoch on task-relevant data significantly improves quality.
  • Skip Frankenmerge unless you have compute for full fine-tuning.
  • Consider MoE only if inference cost is acceptable and you plan to train the router.

8. Reproducibility

All code, configs, and the final model are publicly available:

References

  1. Goddard, C. et al. (2024). "Arcee's MergeKit: A Toolkit for Merging Large Language Models." arXiv:2403.13257
  2. Shoemake, K. (1985). "Animating rotation with quaternion curves." SIGGRAPH.
  3. Yadav, P. et al. (2023). "Resolving Interference When Merging Models." NeurIPS.
  4. Muqeeth, M. et al. (2024). "Soft Merging of Experts with Adaptive Router." ICML.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for spartan8806/ATLES-Merge-Paper