ManniX-ITA's picture
Upload folder using huggingface_hub
9ec0ce0 verified
---
base_model: Qwen/Qwen3.5-4B
license: apache-2.0
library_name: transformers
tags:
- qwen3.5
- merge
- omnimerge
- task-arithmetic
- code
---
# Qwen3.5-4B-MicroCoder
A 4B-parameter code-leaning merge of Qwen3.5-4B that beats every individual
source on LCB-medium-55, while preserving full GSM8K parity with the strongest
reasoning fine-tune in the pool.
This card documents `Qwen3.5-4B-MicroCoder` (internally `v2i-jv-base-task-arith`),
the chosen frontier point of a 19-variant ablation that swept merge methods,
density, importance signals, AIME-protection masks, and skip-layer surgery.
Built with [**OmniMergeKit**](https://github.com/mann1x/omnimergekit) — the
open-source merge engine developed for this work.
## Headline numbers (Q6_K, greedy)
| Benchmark | base Qwen3.5-4B | jackrong-v2 (best source) | **MicroCoder** | Δ vs source |
|---|---:|---:|---:|---:|
| HumanEval (164q) | 60.37 | 60.37 | **57.32** | −3.05 |
| MBPP (500q) | 46.00 | 45.00 | **52.00** | **+7.00** |
| LiveCodeBench-30 (medium, post-2024-10-01) | 3.33 | 23.33 | **26.67** | **+3.34** |
| LiveCodeBench-55 (full medium pool) | — | 25.45 | **27.27** | **+1.82** |
| HumanEvalPlus (164q) | — | 54.88 | 50.00 | −4.88 |
| GSM8K (100q) | — | 83.00 | **83.00** | 0.00 |
| MMLU-Pro (200q) | — | 56.81 | 52.46 | −4.35 |
| AIME (30q) | — | 26.67 | 3.33 | −23.34 |
**Net:** +7pp MBPP, +3.3pp LCB-30, +1.8pp LCB-55, GSM8K parity. Trade-offs are
HumanEval (−3pp), MMLU-Pro (−4.4pp), and the AIME math-reasoning floor
(see "Why no AIME?" below).
## Recipe
```bash
python omnimergekit.py \
--base Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled-v2 \
--task-base Qwen/Qwen3.5-4B \
--source coder_eval/continuum-code-forged \
--source coder_eval/jackrong-python \
--method omnimerge_v2 --v2-features fisher,darex \
--weights 0.55,0.45 --density 0.53 --darex-q 0.85 \
--fisher continuum-forged.safetensors,jackrong-python.safetensors \
--pr682-turbo \
--seed 42 --device cuda
```
This is a **task-arithmetic** merge:
```
MicroCoder = jackrong-v2 + 0.55·DARE(continuum-code-forged − base) + 0.45·DARE(jackrong-python − base)
```
- **`jackrong-v2` is the merge base** — its full output style and reasoning
policy survive intact at zero deltas. The two coding teachers contribute
only their *delta from the official Qwen3.5-4B base*, not their absolute
representations. This isolates "what the coder fine-tunes added on top of
the public base" and grafts that onto the reasoning-distilled model.
- **DAREx-q 0.85** drops the bottom 85% of cf/jp deltas by magnitude
(per-tensor quantile) before random pruning, then rescales by 1/density.
This kills low-magnitude noise while preserving the high-amplitude
code-skill structure.
- **Fisher importance** from forward-pass gradient maps over the coder
fine-tunes' own training-style data weights the EMR election so dominant
per-element directions win when the two coding teachers disagree.
- **PR682-turbo** protects critical layers (norms, embeddings, lm_head,
biases) at density 1.0 and falls back gracefully on shape mismatch.
## Sources
| Model | Role | Weight |
|---|---|---:|
| [`Qwen/Qwen3.5-4B`](https://huggingface.co/Qwen/Qwen3.5-4B) | task base (delta reference) | — |
| [`Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled-v2`](https://huggingface.co/Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled-v2) | merge base | 1.0 (passthrough at δ=0) |
| `continuum-code-forged` | code teacher (delta) | 0.55 |
| `jackrong-python` | code teacher (delta) | 0.45 |
## Evaluation methodology
All evaluations: [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness),
llama.cpp `llama-server` with the published Q6_K quantization,
`/v1/completions` raw endpoint, greedy decoding (`temperature=0.0,
top_p=1.0`), `max_gen_toks=2048` for HE/MBPP, `max_gen_toks=8192` for LCB,
`--parallel 2 --cache-type-k q8_0 --cache-type-v q8_0`.
LiveCodeBench: medium-difficulty functional problems with
`min_date=2024-10-01` (post-Qwen3.5 training cutoff to avoid contamination).
LCB-30 = first 30 problems of that pool, LCB-55 = full pool of 55.
## Experiment trail (why this recipe?)
19 variants were ablated over a multi-week sweep. Summary table for the
informative subset:
| variant | merge form | AIME | HE | MBPP | LCB-30 | verdict |
|---|---|---:|---:|---:|---:|---|
| base | Qwen3.5-4B | 0.00 | 60.4 | 46.0 | 3.33 | floor |
| jackrong-v2 | source | **26.67** | 60.4 | 45.0 | 23.3 | strong reasoning, weak LCB |
| v2g | 3-src DARE-TIES, fisher+darex | 0.00 | 56.1 | **54.0** | 26.7 | code champion (no AIME) |
| **v2i = MicroCoder** | task-arith on jv-base | 3.33 | **57.3** | 52.0 | **26.7** | **balanced — picked** |
| v2j | v2i + skip mlp.gate_proj 18-25, darex 0.92 | 10.00 | — | — | — | first AIME signal |
| v2k | v2j + wider skip 14-27 | 0.00 | — | — | — | over-blocked, collapsed |
| v2l | v2j + full MLP skip 18-25 | 3.33 | — | — | — | up/down_proj carry code skill |
| v2m | v2j + density 0.45 | 3.33 | — | — | — | lower density hits jv harder |
| v2n | v2j + darex 0.95 | **13.33** | 55.5 | 50.8 | 20.0 | reasoning ceiling |
| v2o | v2n + darex 0.97 | 13.33 | 56.7 | 51.0 | 16.7 | saturated |
| v2p | v2n + jv-AIME fisher mask α=1.0 | 13.33 | 55.5 | 50.8 | 20.0 | mask redundant |
| v2q | v2n + jv-AIME mask α=0.5 | 13.33 | 54.9 | 51.2 | 20.0 | mask redundant |
| v2r | mask α=1.0 alone, no skip | 3.33 | — | — | — | per-element scaling cannot replace layer skip |
### Key findings (apply to future merge work)
1. **Task-arithmetic with the strong source as merge_base wins over symmetric
DARE-TIES** when one source is much stronger on the target axis (here:
reasoning). v2g and v2i tie on LCB-55 (27.27%) but v2i wins HE/HE+/GSM8K
and retains a small AIME signal that pure DARE-TIES kills.
2. **Skip mlp.gate_proj layers 18-25 is the load-bearing AIME-recovery knob**
(+6.7pp). This maps from Qwen3.6's think-policy band 27-52/64 → 32-layer
Qwen3.5 = 14-26 conservative narrow 18-25. Wider bands (v2k 14-27)
collapse; full-MLP skip (v2l) destroys code skill.
3. **DAREx-q 0.92 → 0.95 adds 3.3pp AIME on top of the skip** by killing more
low-magnitude cf/jp deltas in the protected reasoning band. **0.95 → 0.97
saturates** (v2n=v2o on AIME).
4. **The jv-AIME fisher suppression mask is fully redundant with skip-layers**
(v2n=v2p=v2q at AIME 13.33 *and* code metrics within noise). Per-element
scaling cannot substitute for layer-level passthrough — jv's reasoning
lives in the *coherent per-layer behavior* of mlp.gate_proj 18-25, not in
the highest-importance individual cells. Mask alone (v2r) gives nothing.
5. **The 13.33% AIME ceiling is structural, not a tuning problem.** Three
different mechanisms (high darex, higher darex, mask) all converge at
the same number. Closing the remaining 13.34pp gap to jv source requires
SFT distillation, not more merge tuning.
### Why no AIME on the chosen variant?
MicroCoder (v2i) is the **code-leaning frontier point**. The skip-layer
recipe (v2n) recovers AIME to 13.33% but at a 6.7pp LCB-30 regression.
v2i preserves the better LCB; the trade is real and structural. A
reasoning-leaning sibling exists internally (v2n) but is not published —
LCB regression makes it strictly worse than `jackrong-v2` for math users
who already have access to the original.
## Files
- Full-precision safetensors weights (BF16). Use [`ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF`](https://huggingface.co/ManniX-ITA/Qwen3.5-4B-MicroCoder) for the Q6_K quantization.
## Use
```bash
llama-server -m Qwen3.5-4B-MicroCoder-Q6_K.gguf \
--port 8099 -c 32768 -t 12 -ngl 99 \
--parallel 2 --cache-type-k q8_0 --cache-type-v q8_0
```
Greedy `temperature=0.0, top_p=1.0` recommended for code tasks.
## Citation
If you use this model or the OmniMergeKit recipes in your work:
```
@misc{mannix2026microcoder,
title = {Qwen3.5-4B-MicroCoder: a task-arithmetic merge for code},
author = {Mannix, F.},
year = {2026},
url = {https://huggingface.co/ManniX-ITA/Qwen3.5-4B-MicroCoder},
note = {Built with OmniMergeKit, https://github.com/mann1x/omnimergekit}
}
```
## License
Apache 2.0, inherited from Qwen3.5-4B and the source fine-tunes.