File size: 8,371 Bytes
9ec0ce0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
base_model: Qwen/Qwen3.5-4B
license: apache-2.0
library_name: transformers
tags:
  - qwen3.5
  - merge
  - omnimerge
  - task-arithmetic
  - code
---

# Qwen3.5-4B-MicroCoder

A 4B-parameter code-leaning merge of Qwen3.5-4B that beats every individual
source on LCB-medium-55, while preserving full GSM8K parity with the strongest
reasoning fine-tune in the pool.

This card documents `Qwen3.5-4B-MicroCoder` (internally `v2i-jv-base-task-arith`),
the chosen frontier point of a 19-variant ablation that swept merge methods,
density, importance signals, AIME-protection masks, and skip-layer surgery.

Built with [**OmniMergeKit**](https://github.com/mann1x/omnimergekit) — the
open-source merge engine developed for this work.

## Headline numbers (Q6_K, greedy)

| Benchmark | base Qwen3.5-4B | jackrong-v2 (best source) | **MicroCoder** | Δ vs source |
|---|---:|---:|---:|---:|
| HumanEval (164q) | 60.37 | 60.37 | **57.32** | −3.05 |
| MBPP (500q) | 46.00 | 45.00 | **52.00** | **+7.00** |
| LiveCodeBench-30 (medium, post-2024-10-01) | 3.33 | 23.33 | **26.67** | **+3.34** |
| LiveCodeBench-55 (full medium pool) | — | 25.45 | **27.27** | **+1.82** |
| HumanEvalPlus (164q) | — | 54.88 | 50.00 | −4.88 |
| GSM8K (100q) | — | 83.00 | **83.00** | 0.00 |
| MMLU-Pro (200q) | — | 56.81 | 52.46 | −4.35 |
| AIME (30q) | — | 26.67 | 3.33 | −23.34 |

**Net:** +7pp MBPP, +3.3pp LCB-30, +1.8pp LCB-55, GSM8K parity. Trade-offs are
HumanEval (−3pp), MMLU-Pro (−4.4pp), and the AIME math-reasoning floor
(see "Why no AIME?" below).

## Recipe

```bash
python omnimergekit.py \
    --base       Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled-v2 \
    --task-base  Qwen/Qwen3.5-4B \
    --source     coder_eval/continuum-code-forged \
    --source     coder_eval/jackrong-python \
    --method omnimerge_v2 --v2-features fisher,darex \
    --weights 0.55,0.45 --density 0.53 --darex-q 0.85 \
    --fisher continuum-forged.safetensors,jackrong-python.safetensors \
    --pr682-turbo \
    --seed 42 --device cuda
```

This is a **task-arithmetic** merge:

```
MicroCoder = jackrong-v2 + 0.55·DARE(continuum-code-forged − base) + 0.45·DARE(jackrong-python − base)
```

- **`jackrong-v2` is the merge base** — its full output style and reasoning
  policy survive intact at zero deltas. The two coding teachers contribute
  only their *delta from the official Qwen3.5-4B base*, not their absolute
  representations. This isolates "what the coder fine-tunes added on top of
  the public base" and grafts that onto the reasoning-distilled model.
- **DAREx-q 0.85** drops the bottom 85% of cf/jp deltas by magnitude
  (per-tensor quantile) before random pruning, then rescales by 1/density.
  This kills low-magnitude noise while preserving the high-amplitude
  code-skill structure.
- **Fisher importance** from forward-pass gradient maps over the coder
  fine-tunes' own training-style data weights the EMR election so dominant
  per-element directions win when the two coding teachers disagree.
- **PR682-turbo** protects critical layers (norms, embeddings, lm_head,
  biases) at density 1.0 and falls back gracefully on shape mismatch.

## Sources

| Model | Role | Weight |
|---|---|---:|
| [`Qwen/Qwen3.5-4B`](https://huggingface.co/Qwen/Qwen3.5-4B) | task base (delta reference) | — |
| [`Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled-v2`](https://huggingface.co/Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled-v2) | merge base | 1.0 (passthrough at δ=0) |
| `continuum-code-forged` | code teacher (delta) | 0.55 |
| `jackrong-python` | code teacher (delta) | 0.45 |

## Evaluation methodology

All evaluations: [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness),
llama.cpp `llama-server` with the published Q6_K quantization,
`/v1/completions` raw endpoint, greedy decoding (`temperature=0.0,
top_p=1.0`), `max_gen_toks=2048` for HE/MBPP, `max_gen_toks=8192` for LCB,
`--parallel 2 --cache-type-k q8_0 --cache-type-v q8_0`.

LiveCodeBench: medium-difficulty functional problems with
`min_date=2024-10-01` (post-Qwen3.5 training cutoff to avoid contamination).
LCB-30 = first 30 problems of that pool, LCB-55 = full pool of 55.

## Experiment trail (why this recipe?)

19 variants were ablated over a multi-week sweep. Summary table for the
informative subset:

| variant | merge form | AIME | HE | MBPP | LCB-30 | verdict |
|---|---|---:|---:|---:|---:|---|
| base | Qwen3.5-4B | 0.00 | 60.4 | 46.0 | 3.33 | floor |
| jackrong-v2 | source | **26.67** | 60.4 | 45.0 | 23.3 | strong reasoning, weak LCB |
| v2g | 3-src DARE-TIES, fisher+darex | 0.00 | 56.1 | **54.0** | 26.7 | code champion (no AIME) |
| **v2i = MicroCoder** | task-arith on jv-base | 3.33 | **57.3** | 52.0 | **26.7** | **balanced — picked** |
| v2j | v2i + skip mlp.gate_proj 18-25, darex 0.92 | 10.00 | — | — | — | first AIME signal |
| v2k | v2j + wider skip 14-27 | 0.00 | — | — | — | over-blocked, collapsed |
| v2l | v2j + full MLP skip 18-25 | 3.33 | — | — | — | up/down_proj carry code skill |
| v2m | v2j + density 0.45 | 3.33 | — | — | — | lower density hits jv harder |
| v2n | v2j + darex 0.95 | **13.33** | 55.5 | 50.8 | 20.0 | reasoning ceiling |
| v2o | v2n + darex 0.97 | 13.33 | 56.7 | 51.0 | 16.7 | saturated |
| v2p | v2n + jv-AIME fisher mask α=1.0 | 13.33 | 55.5 | 50.8 | 20.0 | mask redundant |
| v2q | v2n + jv-AIME mask α=0.5 | 13.33 | 54.9 | 51.2 | 20.0 | mask redundant |
| v2r | mask α=1.0 alone, no skip | 3.33 | — | — | — | per-element scaling cannot replace layer skip |

### Key findings (apply to future merge work)

1. **Task-arithmetic with the strong source as merge_base wins over symmetric
   DARE-TIES** when one source is much stronger on the target axis (here:
   reasoning). v2g and v2i tie on LCB-55 (27.27%) but v2i wins HE/HE+/GSM8K
   and retains a small AIME signal that pure DARE-TIES kills.

2. **Skip mlp.gate_proj layers 18-25 is the load-bearing AIME-recovery knob**
   (+6.7pp). This maps from Qwen3.6's think-policy band 27-52/64 → 32-layer
   Qwen3.5 = 14-26 conservative narrow 18-25. Wider bands (v2k 14-27)
   collapse; full-MLP skip (v2l) destroys code skill.

3. **DAREx-q 0.92 → 0.95 adds 3.3pp AIME on top of the skip** by killing more
   low-magnitude cf/jp deltas in the protected reasoning band. **0.95 → 0.97
   saturates** (v2n=v2o on AIME).

4. **The jv-AIME fisher suppression mask is fully redundant with skip-layers**
   (v2n=v2p=v2q at AIME 13.33 *and* code metrics within noise). Per-element
   scaling cannot substitute for layer-level passthrough — jv's reasoning
   lives in the *coherent per-layer behavior* of mlp.gate_proj 18-25, not in
   the highest-importance individual cells. Mask alone (v2r) gives nothing.

5. **The 13.33% AIME ceiling is structural, not a tuning problem.** Three
   different mechanisms (high darex, higher darex, mask) all converge at
   the same number. Closing the remaining 13.34pp gap to jv source requires
   SFT distillation, not more merge tuning.

### Why no AIME on the chosen variant?

MicroCoder (v2i) is the **code-leaning frontier point**. The skip-layer
recipe (v2n) recovers AIME to 13.33% but at a 6.7pp LCB-30 regression.
v2i preserves the better LCB; the trade is real and structural. A
reasoning-leaning sibling exists internally (v2n) but is not published —
LCB regression makes it strictly worse than `jackrong-v2` for math users
who already have access to the original.

## Files

- Full-precision safetensors weights (BF16). Use [`ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF`](https://huggingface.co/ManniX-ITA/Qwen3.5-4B-MicroCoder) for the Q6_K quantization.

## Use

```bash
llama-server -m Qwen3.5-4B-MicroCoder-Q6_K.gguf \
    --port 8099 -c 32768 -t 12 -ngl 99 \
    --parallel 2 --cache-type-k q8_0 --cache-type-v q8_0
```

Greedy `temperature=0.0, top_p=1.0` recommended for code tasks.

## Citation

If you use this model or the OmniMergeKit recipes in your work:

```
@misc{mannix2026microcoder,
  title  = {Qwen3.5-4B-MicroCoder: a task-arithmetic merge for code},
  author = {Mannix, F.},
  year   = {2026},
  url    = {https://huggingface.co/ManniX-ITA/Qwen3.5-4B-MicroCoder},
  note   = {Built with OmniMergeKit, https://github.com/mann1x/omnimergekit}
}
```

## License

Apache 2.0, inherited from Qwen3.5-4B and the source fine-tunes.