File size: 7,645 Bytes
1679805
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
base_model: TeichAI/Qwen3.5-4B-Claude-Opus-Reasoning
tags:
- llama.cpp
- gguf
- unsloth
- qwen3_5
- reasoning
- distillation
- claude-opus
- tool-use
license: apache-2.0
language:
- en
datasets:
- TeichAI/Claude-Opus-4.6-Reasoning-887x
- TeichAI/Claude-Sonnet-4.6-Reasoning-799x
- TeichAI/claude-4.5-opus-high-reasoning-250x
- Crownelius/Opus-4.6-Reasoning-2100x-formatted
---

# Qwen3.5 4B β€” Claude Opus Reasoning Distillation

> **A careful approach to distillation: Premium reasoning capabilities transferred in a single epoch with minimal capability loss.**

![alt="General Benchmark Comparison Chart"](benchmarks/all.png)

Before you dismiss this as yet another community distillation with the usual quality tradeoffs β€” **stop and read this.**

This model takes a more careful approach to distillation. We've transferred Claude Opus 4.6's reasoning patterns and conversational style into Qwen3.5-4B while **avoiding the catastrophic forgetting** that plagues many community distillation attempts. The result: net improvements across most benchmarks with only minor tradeoffs.

---

## 🎯 Why This Model is Different

### The Distillation Problem Everyone Ignores

Most community distillations follow a predictable pattern:
1. Collect synthetic data from a frontier model
2. Train for multiple epochs until loss looks good
3. Ship it and hope for the best

The result? Models that *feel* different but perform *worse*. They lose capabilities on benchmarks, develop repetition issues, forget how to follow instructions properly, perform noticeably worse on coding & math tasks, and exhibit the telltale signs of overfitting that make them unreliable for real-world use.

**We took a completely different approach.**

### The Single-Epoch Revolution

Our methodology proves that **quality dramatically outweighs quantity** in distillation:

| Aspect | Typical Community Distills | Our Approach |
|--------|---------------------------|--------------|
| **Epochs** | 2-4 epochs | **1 epoch** |
| **Data Quality** | Mass-generated synthetic | **Hand-curated Opus reasoning traces** |
| **Capability Retention** | Significant regressions | **Mostly preserved with net gains** |
| **Overfitting** | Common | **None observed** |
| **Output Quality** | Degraded task completion | **Clean, purposeful generation** |

By training for exactly one epoch on curated data, we achieve style transfer while minimizing damage to the model's foundational capabilities. Most of the base model's knowledge remains intact while gaining reasoning patterns from Claude Opus.

---

## 🧠 What Makes the Training Data Special

### Premium Reasoning from Claude Opus 4.6

This isn't data scraped from random API calls or generated with lazy prompting. Almost every training example comes from **Claude Opus 4.6** β€” Anthropic's most capable reasoning model β€” executing complex, multi-step reasoning tasks. To strengthen the data corpus another ~800 examples were used from **Claude Sonnet 4.6**

The dataset includes:
- **Deep analytical reasoning** with explicit thinking traces
- **Multi-turn conversations** that maintain coherent context
- **Complex problem decomposition** showing how to break down difficult problems
- **Self-correction patterns** where the model catches and fixes its own mistakes

### Mixed Tool + Non-Tool Corpus

Our training corpus intentionally includes:
- **~92% pure reasoning examples** β€” analytical thinking, problem-solving, explanations
- **~8% tool-use examples** β€” web search, data fetching, structured operations

This ratio mirrors realistic assistant usage patterns and ensures the model:
1. Doesn't over-index on tool calling when it's unnecessary
2. Knows *when* and *how* to invoke tools appropriately
3. Maintains strong reasoning even when tools are available but not needed
4. Keeps all code-related post-training intact

Tools included: `web_search`, `web_fetch`, `grep`

---

## πŸ“Š Benchmark Results

Head-to-head against the base `unsloth/Qwen3.5-4B`:

| Benchmark | Base | Fine-tuned | Ξ” | Result |
|-----------|------|------------|-------|--------|
| **ifeval** | 0.262 | **0.309** | **+17.6%** | βœ… Win |
| **arc_challenge** | 0.346 | **0.392** | **+13.3%** | βœ… Win |
| **winogrande** | 0.589 | **0.638** | **+8.3%** | βœ… Win |
| **hellaswag** | 0.496 | **0.500** | +0.9% | βœ… Win |
| **gpqa_diamond** | 0.283 | 0.283 | 0% | βž– Tie |
| **truthfulqa_mc2** | **0.545** | 0.530 | -2.7% | ❌ Loss |
| **mmlu** | **0.256** | 0.232 | -9.6% | ❌ Loss |

**Summary: 4 wins, 2 losses, 1 tie.**

![alt="MMLU Subject Breakdown"](benchmarks/mmlu.png)

### What This Means

- **Reasoning & instruction following improved** β€” IFEval (+17.6%), ARC (+13.3%), and Winogrande (+8.3%) gains show better logical reasoning and instruction adherence
- **Knowledge tradeoff on MMLU** β€” The -9.6% MMLU drop suggests some factual recall displacement (common in style transfers)
- **TruthfulQA mostly preserved** β€” Only -2.7% loss, indicating the model didn't pick up hallucination tendencies

### Qualitative Improvements

- **Reduced token generation** β€” More concise outputs without verbose padding
- **Fixed thinking loops** β€” Base model's tendency to get stuck in reasoning cycles is reduced
- **Deeper reasoning traces** β€” `<think>` blocks show more structured analytical depth
- **Better conversational flow** β€” Responses feel more natural and contextually aware

---

## πŸ”¬ Technical Details

### Key Methodological Choices

1. **Response-only training** β€” Loss computed only on assistant outputs, not user inputs
2. **Preserved reasoning traces** β€” `<think>` blocks kept intact for reasoning-style transfer
3. **Strict data validation** β€” Malformed traces, duplicates, and broken tool calls removed
4. **Consistent formatting** β€” Unified chat template across all sources

### πŸ“¦ Dataset Composition

| Source | Examples | Type |
|--------|----------|------|
| [TeichAI/Claude-Opus-4.6-Reasoning-887x](https://huggingface.co/datasets/TeichAI/Claude-Opus-4.6-Reasoning-887x) | 887 | Mixed |
| [TeichAI/Claude-Sonnet-4.6-Reasoning-799x](https://huggingface.co/datasets/TeichAI/Claude-Sonnet-4.6-Reasoning-799x) | 799 | Pure reasoning |
| [TeichAI/claude-4.5-opus-high-reasoning-250x](https://huggingface.co/datasets/TeichAI/claude-4.5-opus-high-reasoning-250x) | 250 | High complexity |
| [Crownelius/Opus-4.6-Reasoning-2100x-formatted](https://huggingface.co/datasets/Crownelius/Opus-4.6-Reasoning-2100x-formatted) | 2100 | Pure reasoning |
| **Total** | **~4000** | Mixed tool/non-tool |

---

## πŸ’‘ Lessons Learned

### What Worked

1. **Single epoch training** β€” Avoided the overfitting that causes catastrophic forgetting in multi-epoch runs
2. **Quality over quantity** β€” ~4000 curated examples outperformed what we'd expect from larger noisy datasets
3. **Mixed tool/non-tool data** β€” Kept the model grounded in both reasoning and tool-use contexts
4. **Response-only loss** β€” Training only on assistant outputs preserved instruction-following

### Tradeoffs to Consider

- Small MMLU/TruthfulQA regressions suggest some factual knowledge displacement
- Style transfer always has costs β€” this approach minimizes but doesn't eliminate them
- Your mileage may vary depending on use case

---

## πŸ™ Acknowledgments

This model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Hugging Face's TRL library.

[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

---

## πŸ“œ License

Apache 2.0 β€” Use freely, build boldly.