Dual-Teacher Distillation and Strategic SLERP Merging

Community Article Published April 17, 2026

In this work, we explore a complementary approach: rather than distilling from a single teacher model, we investigate whether distillation from multiple specialized teachers followed by strategic model merging can produce a student model that leverages the complementary strengths of each teacher. Specifically, we make the following contributions:

Dual-Teacher Distillation Datasets: We create two high-quality reasoning distillation datasets using different teacher models—Kimi-2.5-thinking and Qwen3.6-plus—covering complementary domains such as coding, mathematics, science, finance, and strategic planning.
QLoRA-Based Fine-Tuning Pipeline: We fine-tune Qwen3-4B-Thinking-2507 using QLoRA on each dataset, producing two specialized reasoning models with distinct capability profiles.
Custom SLERP Merging Strategy (“Golden Path”): We develop a novel SLERP merge configuration that addresses the well-documented catastrophic forgetting problem in model merging by pinning vocabulary layers and applying smooth gradient interpolation across transformer layers.
Multi-Domain Benchmark (CMDR-Bench): We design and release a comprehensive benchmark with 100 carefully curated test cases across 10 cognitive domains with graduated difficulty levels.
Empirical Analysis: We provide a thorough comparative analysis demonstrating that the merged model achieves synergistic improvements—a “1+1=3 effect”—in logical reasoning and planning tasks, exceeding the performance of both individual distilled models and the base model.

Knowledge Distillation for Small Language Models

Knowledge distillation (KD) [3] has a rich history in deep learning, originally proposed by Hinton et al. (2015) for model compression. In the LLM era, the paradigm has evolved significantly. Traditional KD methods focused on matching output logits or intermediate activations between teacher and student models. However, the current dominant approach for reasoning distillation is output-based distillation using reasoning traces: generating training data consisting of questionanswer pairs with full chain-of-thought (CoT) reasoning traces from a capable teacher model, then fine-tuning a smaller student model on this data.

Our work builds on this foundation by exploring dual-teacher distillation, where two different teacher models (Kimi-2.5-thinking and Qwen3.6-plus) contribute complementary reasoning patterns to the training data, followed by model merging to combine the acquired capabilities.

Model Merging

Model merging has rapidly evolved as a practical approach to combine capabilities from multiple fine-tuned models without additional training. The field encompasses several techniques:

• SLERP (Spherical Linear Interpolation) [7]: Originally developed for smooth quaternion rotation in computer graphics, SLERP normalizes weight vectors onto a unit hypersphere and interpolates along the geodesic arc, preserving both direction and magnitude.

This prevents the “weight collapse” that occurs with linear interpolation when vectors point in different directions. The mathematical formulation is:

SLERP(p0, p1, t) = sin((1 − t)Ω)/sin(Ω) p0 + sin(tΩ)/sin(Ω) p

where Ω is the angle between vectors p0 and p1, and t ∈ [0, 1] is the interpolation parameter.

A critical challenge in model merging is catastrophic forgetting [11, 12]: the merged model often performs worse than either source model due to parameter interference. A critical challenge in model merging is catastrophic forgetting [11, 12]: the merged model often performs worse than either source model due to parameter interference. Yadav et al. [8] demonstrated that constructive interference is not guaranteed and depends heavily on method selection and model compatibility. Furthermore, research on reversible merging for low-rank weights [13] has shown that conventional merging methods can cause severe degradation when applied to QLoRA/LoRA weights, which is directly relevant to our distillation pipeline. Our “Golden Path” strategy addresses these challenges through targeted layer pinning and gradient interpolation.

Evaluation Benchmarks

Our CMDR-Bench contributes to this landscape by providing a focused evaluation of reasoning capabilities across 10 domains with graduated difficulty, specifically designed to discriminate between models in the 4B parameter range where capability differences are nuanced and require careful assessment.

The 10 domains were selected to provide comprehensive coverage of the reasoning capabilities most relevant to real-world applications of language models:

Logical Reasoning (Text-Based): Tests deductive and inductive reasoning from natural language premises, including syllogisms, conditional reasoning, and logical puzzles.
Mathematical Reasoning: Covers arithmetic, algebra, geometry, combinatorics, and number theory problems at graduated difficulty levels.
SQL Query Generation: Evaluates the ability to translate natural language requirements into correct SQL queries given schema and metadata, testing both syntactic accuracy and semantic understanding.
Python Code Analysis and Debugging: Presents code snippets with bugs, performance issues, or logical errors, requiring models to identify and fix problems.
Scientific Explanation and Hypothesis Evaluation (RAG): Tests the model’s ability to evaluate scientific hypotheses and provide explanations, potentially requiring retrievalaugmented generation capabilities.
Complex Scenario Analysis and Conclusion Derivation: Presents multi-faceted scenarios requiring synthesis of multiple constraints and variables to reach correct conclusions.
Ethical Dilemma Evaluation: Tests nuanced reasoning about ethical trade-offs, requiring models to identify competing values and construct well-reasoned positions.
Causal Reasoning in Historical Events (RAG): Evaluates the ability to identify causal chains in historical events, requiring both factual knowledge and analytical reasoning.

Methodology

Our approach follows a three-stage pipeline: (1) Dataset Construction—generating highquality reasoning trace datasets from two distinct teacher models; (2) Distillation FineTuning—training specialized reasoning variants of a small base model; and (3) Strategic Merging—combining the distilled models via custom SLERP to achieve synergistic performance. Figure 1 illustrates this pipeline.

Figure 1

Fine-Tuning Pipeline

Dataset 1: Kimi-2.5-High-Reasoning-250x

The first distillation dataset was generated using Kimi-2.5-thinking as the teacher model, which produces detailed reasoning traces and final answers for complex questions. The dataset covers multiple technical, scientific, historical, and strategic domains, with particular emphasis on analytical depth and structured problem decomposition.

Dataset 2: Qwen3.6-Plus-High-Reasoning-500x

The second dataset was prepared using Qwen3.6-plus, covering topics such as coding, mathematics, finance, medicine, and economics. With 500 samples and 1,739,249 total tokens, this dataset is larger and more domain-focused than the Kimi-derived dataset. The Qwen3.6-plus teacher model produces structured, concise reasoning traces with a strong emphasis on mathematical precision and algorithmic thinking.

The complementary nature of these two datasets is by design: the Kimi-derived dataset provides broad analytical coverage with rich exploratory reasoning, while the Qwen-derived dataset offers deeper coverage of quantitative and computational domains with more structured outputs. This complementarity is key to the success of the subsequent merging stage

Base Model

Both distilled models share the same base architecture: Qwen/Qwen3-4B-Thinking-2507. This thinking model features a 4-billion-parameter transformer architecture with support for extended reasoning via internal chain-of-thought processing. The Thinking variant is particularly suitable as a distillation target because it already possesses the architectural capacity for multi-step reasoning, which can be refined and redirected through targeted fine-tuning.

Model 1: Qwen3-4B-Kimi2.5-Reasoning-Distilled

This model was fine-tuned on the Kimi-2.5-High-Reasoning-250x dataset (250 samples, 1.1M tokens). The distillation successfully transferred Kimi-2.5’s analytical capabilities, producing a model that excels at breaking down complex problems, self-correcting during reasoning, and providing detailed analytical answers. The model demonstrates particular strength in scientific explanation, causal reasoning in historical events, and complex scenario analysis.

Model 2: Qwen3-4B-Qwen3.6-Plus-Reasoning-Distilled

This model was fine-tuned on the Qwen3.6-Plus-High-Reasoning-500x dataset (500 samples, 1.7M tokens). The distillation focused on transferring mathematical precision and structured algorithmic thinking from the larger Qwen3.6-plus model. A key qualitative improvement observed in this model is the transformation of reasoning style: the base model’s stream-of-consciousness, exploratory approach is replaced by structured, professional, report-oriented reasoning that proceeds confidently from problem analysis through algorithm design to complexity analysis. Table 2 summarizes the qualitative differences.

Table 2

SLERP Merging Strategy: The “Golden Path”

Motivation

Standard SLERP merges, while preserving weight norms and angular relationships better than linear interpolation, often suffer from two critical failure modes: (1) RAG/Vocabulary Degradation, where the merged model’s embedding and output layers interfere, causing degraded performance in retrieval-augmented generation and text generation quality; and (2) Catastrophic Forgetting, where the merged model loses capabilities present in both source models due to destructive weight interference in intermediate layers.

Configuration

To address these challenges, we developed the “Golden Path” (V5) SLERP configuration, implemented via MergeKit. The key innovations are:

Vocabulary Pinning: The embed_tokens and lm_head layers are strictly pinned to the Qwen model (t = 1.0). This ensures the merged model reads and generates text using exclusively Qwen’s vocabulary, completely eliminating the RAG degradation problem that arises when vocabulary representations from different fine-tuned models interfere.
Gradient Interpolation: The intermediate attention and MLP layers follow a smooth gradient from full Kimi influence to full Qwen influence: [0, 0.1, 0.2, 0.3, 0.5, 0.7, 0.8, 0.9, 1]. This prevents abrupt weight transitions that cause interference in deep reasoning steps, allowing earlier layers to retain Kimi’s broad analytical patterns while later layers leverage Qwen’s structured mathematical precision.
Base Model Selection: Kimi-distilled model serves as the base model, ensuring the merge preserves Kimi’s broad analytical foundation while selectively incorporating Qwen’s precision.

Table 3 shows the complete YAML configuration used for the merge.

The gradient [0, 0.1, 0.2, 0.3, 0.5, 0.7, 0.8, 0.9, 1] represents the interpolation parameter t (from Equation 1) applied across nine equal partitions of the model’s intermediate layers. At t = 0, the merged weight equals the Kimi model weight; at t = 1, it equals the Qwen model weight. The non-uniform spacing (accelerating from Kimi to Qwen) was determined empirically through multiple iterations, with each variant evaluated on CMDR-Bench to identify the configuration that maximizes synergistic gains.

Table 3: SLERP merge configuration (Golden Path V5).

models:
- model: khazarai/Qwen3-4B-Kimi2.5-Reasoning-Distilled
- model: khazarai/Qwen3-4B-Qwen3.6-plus-Reasoning-Distilled
merge_method: slerp
base_model: khazarai/Qwen3-4B-Kimi2.5-Reasoning-Distilled
parameters:
t:
- filter: embed_tokens value: 1
- filter: lm_head value: 1
- value: 1
- filter: self value: [0, 0.1, 0.2, 0.3, 0.5, 0.7, 0.8, 0.9, 1]
dtype: bfloat16

Experiments and Results

We evaluate four models on CMDR-Bench:

Qwen3-4B-Thinking-2507 (Base Model): The original pre-trained model without distillation.
Qwen3-4B-Kimi2.5-Reasoning-Distilled (Kimi-Distilled): Fine-tuned on Kimi-derived dataset.
Qwen3-4B-Qwen3.6-Plus-Reasoning-Distilled (Qwen-Distilled): Fine-tuned on Qwen3.6- plus-derived dataset.
Qwen3-4B-Qwen3.6-Plus-Reasoning-Slerp (Merged Model): SLERP merge of models 2 and 3 using Golden Path configuration.

Scoring Methodology

Each test case is scored on a binary pass/fail basis with partial credit. The graduated difficulty ensures that the benchmark can distinguish between models of varying capability levels, with Level 1–3 problems accessible to most competent models and Level 8–10 problems requiring advanced reasoning capabilities. Domain-level scores are computed as the average across the 10 test cases in that domain.

Main Results

Table 4 presents the complete benchmark results across all 10 domains and 4 models. Figure 2 provides a visual comparison of the performance profiles.

Key Findings

Synergistic Gains in Target Domains

The most striking result is the emergence of synergistic performance improvements in the merged model that exceed both individual distilled models. Two domains stand out:

Logical Reasoning: The merged model achieves 76.4%, outperforming the Kimi-distilled model (68.2%), the Qwen-distilled model (60.0%), and the base model (60.0%). This represents a 16.4 percentage point improvement over the best distilled model and a 27.3% relative improvement over the base model. We attribute this to the complementary nature of the two distillation sources: Kimi’s broad analytical patterns combined with Qwen’s structured mathematical thinking create a more robust logical reasoning capability.

Planning and Optimization: The merged model achieves 72.7%, representing a dramatic improvement over the Qwen-distilled model (56.4%), the Kimi-distilled model (43.6%), and the base model (38.2%). This 16.3 percentage point gain over the best distilled model is the largest single-domain improvement observed. Planning tasks require both analytical decomposition.

Table 4: CMDR-Bench results: Success rates (%) across 10 cognitive domains for four models. Bold indicates the best score per domain; underline indicates the merged model’s score.

Table 4

Figure 2: Multi-model reasoning performance comparison across 10 benchmark domains. The merged model (green) demonstrates synergistic improvements in Logical Reasoning and Planning & Optimization.

Figure 2

Performance Preservation

The merged model successfully maintains the strengths of both parent models in most domains. It retains perfect scores (100.0%) in Mathematical Reasoning and Scientific Explanation, matching both the base and distilled models. It also preserves the Qwen-distilled model’s strong Python Code Analysis performance (95.5%) without degradation.

Expected Trade-offs

The merged model shows a sharp decline in Constrained Creative Writing (26.4% vs. base model’s 34.5% and Kimi-distilled’s 52.7%). This is an expected and accepted trade-off. The distillation and merging process specifically optimizes for logical, mathematical, and structured reasoning capabilities. The model’s creative writing degradation is a natural consequence of redirecting the model’s capacity toward analytical tasks. We explicitly note that this model is not recommended for creative writing, poetry, or imaginative storytelling use cases.

The Ethical Dilemma domain shows a modest decline (65.5% vs. base model’s 74.5%). This suggests that ethical reasoning, which requires balancing multiple competing values and perspectives, may be partially disrupted by the merge process. The SQL Query Generation domain also shows a slight decline from the base model’s perfect 100.0% to 81.8%, though this matches the Qwen-distilled model and may reflect a shift toward reasoning patterns that prioritize analytical correctness over syntactic efficiency.

RAG Capabilities

Two RAG-dependent domains provide insight into the effectiveness of our vocabulary pinning strategy. Scientific Explanation (RAG) maintains a perfect 100.0% across all models. Causal Reasoning (RAG) shows the merged model at 91.8%, lower than the base model’s 100.0% but higher than the Kimi-distilled model’s 90.9%. The relative preservation of RAG capabilities, compared to the severe degradation typically observed in standard SLERP merges, validates our vocabulary pinning approach.

Ablation: The Importance of the Gradient Configuration

The “Golden Path” configuration was the result of multiple iterations. Earlier configurations that used uniform interpolation (t = 0.5 throughout) or reverse gradients produced inferior results, with the merged model frequently performing worse than both parent models—a classic case of destructive interference. The non-uniform gradient [0, 0.1, 0.2, 0.3, 0.5, 0.7, 0.8, 0.9, 1] was empirically determined to maximize synergistic gains while minimizing destructive interference. The asymmetric acceleration (slower transition from Kimi in early layers, faster adoption of Qwen in later layers) suggests that broad analytical patterns are primarily encoded in lower transformer layers, while structured output formatting is primarily governed by upper layers.

Discussion

The “1+1=3” Effect in Model Merging

Our results demonstrate that under carefully controlled conditions, model merging can produce outcomes that exceed the sum of their parts. The merged model’s average score of 79.1% surpasses the Kimi-distilled model’s 76.8% and the Qwen-distilled model’s 75.7%, with dramatic improvements in specific domains (Logical Reasoning: +8.2pp; Planning: +16.3pp over the best individual distilled model). This “1+1=3” effect is not guaranteed—as noted by Yadav et al. [8] and confirmed by systematic studies [14]—but can be achieved through:

Complementary distillation sources: Using teacher models with different reasoning styles (exploratory vs. structured) ensures that the resulting student models encode complementary capabilities.
Vocabulary preservation: Pinning embedding and output layers prevents the RAG degradation that frequently undermines merged models in practical applications.
Gradient interpolation: Non-uniform layer-wise interpolation respects the functional organization of transformer layers, preserving different types of capabilities in different layers.

Limitations

Several limitations of this work should be acknowledged:

Benchmark Size: CMDR-Bench comprises 100 test cases.
Single Base Architecture: All experiments use Qwen3-4B as the base model.
Qualitative vs Quantitative: Our analysis is primarily quantitative (benchmark scores). Qualitative analysis of reasoning traces, error patterns, and failure modes would provide deeper insight into the mechanisms behind the observed improvements.
Teacher Model Diversity: We use two teacher models (Kimi-2.5-thinking and Qwen3.6- plus). Investigating whether adding more teachers (e.g., DeepSeek-R1, Claude, GPT-4) and merging via multi-model techniques (Task Arithmetic, Model Soup) could further improve results is a promising direction.
Creative Writing Degradation: The sharp decline in creative writing capabilities, while expected, limits the model’s versatility. Future work could explore techniques to preserve creative capabilities during the merge process, such as weighted interpolation or task-vectorbased capability injection.

Conclusion

This article presents a comprehensive investigation into enhancing small language model reasoning through dual-teacher knowledge distillation and strategic SLERP merging. Our threestage pipeline—constructing complementary distillation datasets from Kimi-2.5-thinking and Qwen3.6-plus, fine-tuning Qwen3-4B via QLoRA, and merging with a custom “Golden Path” SLERP configuration—demonstrates that carefully orchestrated distillation-then-merging can produce compact reasoning models with synergistic capabilities. The key contributions are: (1) two high-quality reasoning distillation datasets covering complementary domains, (2) two specialized distilled models with distinct reasoning profiles, (3) a novel SLERP merging strategy that addresses vocabulary degradation and catastrophic forgetting through layer pinning and gradient interpolation, and (4) a comprehensive 100-case benchmark across 10 cognitive domains. The merged model achieves a “1+1=3” effect in Logical Reasoning (76.4%) and Planning and Optimization (72.7%), outperforming both individual distilled models and the base model. Our findings suggest that the future of efficient reasoning models lies not just in scaling up or training from scratch, but in intelligently combining specialized capabilities through distillation and merging. The “Golden Path” strategy provides a practical, reproducible framework for achieving this, and we release all models, datasets, and benchmark to facilitate further research in this direction.

References

[1] Daya Guo, Qihao Zhu, Dejian Yang, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.

[2] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Zeming Luke. Qlora: Efficient finetuning of quantized language models. Advances in Neural Information Processing Systems,36, 2023.

[3] Canwen Xu, Yifan Sun, et al. A survey on knowledge distillation of large language models. arXiv preprint arXiv:2402.13116, 2024.

[4] SR2015 et al. Symbolic chain-of-thought distillation: Small language models can also “reason”. arXiv preprint arXiv:2306.14050, 2023.

[5] Xinyu Wang et al. Chain-of-thought curriculum distillation. Proceedings of the ACM, 2025.

[6] Edward J Hu, Yelong Shen, Phillip Wallis, et al. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2022.

[7] Ken Shoemake. Animating rotation with quaternion curves. ACM SIGGRAPH Computer Graphics, 19(3):245–254, 1985.

[8] Prateek Yadav, Vikas Chandra, Chaitanya Loho, et al. Ties-merging: Resolving interference when merging models. Advances in Neural Information Processing Systems, 36, 2023.

[9] Arcee AI. Mergekit: A toolkit for merging large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (Industry Track), 2024.

[10] Enneng Yang et al. Model merging in large language models: Methods, theories, and applications. arXiv preprint arXiv:2408.07666, 2024.

[11] Branch and Merge Team. Mitigating catastrophic forgetting in language transfer via model merging. ACL EMNLP Findings, 2024.

[12] Multiple. Spurious forgetting in continual learning of language models. NeurIPS, 2024.

[13] Multiple. Towards reversible model merging for low-rank weights. arXiv preprint arXiv:2510.14163, 2025.

[14] Multiple. A systematic study of model merging techniques in large llms. arXiv preprint arXiv:2511.21437, 2025.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote