---
base_model: Qwen/Qwen3.5-4B
license: apache-2.0
library_name: transformers
tags:
  - qwen3.5
  - merge
  - omnimerge
  - task-arithmetic
  - code
---

# Qwen3.5-4B-MicroCoder

A 4B-parameter code-leaning merge of Qwen3.5-4B that beats every individual
source on LCB-medium-55, while preserving full GSM8K parity with the strongest
reasoning fine-tune in the pool.

This card documents `Qwen3.5-4B-MicroCoder` (internally `v2i-jv-base-task-arith`),
the chosen frontier point of a 19-variant ablation that swept merge methods,
density, importance signals, AIME-protection masks, and skip-layer surgery.

Built with [**OmniMergeKit**](https://github.com/mann1x/omnimergekit) — the
open-source merge engine developed for this work.

## Headline numbers (Q6_K, greedy)

| Benchmark | base Qwen3.5-4B | jackrong-v2 (best source) | **MicroCoder** | Δ vs source |
|---|---:|---:|---:|---:|
| HumanEval (164q) | 60.37 | 60.37 | **57.32** | −3.05 |
| MBPP (500q) | 46.00 | 45.00 | **52.00** | **+7.00** |
| LiveCodeBench-30 (medium, post-2024-10-01) | 3.33 | 23.33 | **26.67** | **+3.34** |
| LiveCodeBench-55 (full medium pool) | — | 25.45 | **27.27** | **+1.82** |
| HumanEvalPlus (164q) | — | 54.88 | 50.00 | −4.88 |
| GSM8K (100q) | — | 83.00 | **83.00** | 0.00 |
| MMLU-Pro (200q) | — | 56.81 | 52.46 | −4.35 |
| AIME (30q) | — | 26.67 | 3.33 | −23.34 |

**Net:** +7pp MBPP, +3.3pp LCB-30, +1.8pp LCB-55, GSM8K parity. Trade-offs are
HumanEval (−3pp), MMLU-Pro (−4.4pp), and the AIME math-reasoning floor
(see "Why no AIME?" below).

## Recipe

```bash
python omnimergekit.py \
    --base       Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled-v2 \
    --task-base  Qwen/Qwen3.5-4B \
    --source     coder_eval/continuum-code-forged \
    --source     coder_eval/jackrong-python \
    --method omnimerge_v2 --v2-features fisher,darex \
    --weights 0.55,0.45 --density 0.53 --darex-q 0.85 \
    --fisher continuum-forged.safetensors,jackrong-python.safetensors \
    --pr682-turbo \
    --seed 42 --device cuda
```

This is a **task-arithmetic** merge:

```
MicroCoder = jackrong-v2 + 0.55·DARE(continuum-code-forged − base) + 0.45·DARE(jackrong-python − base)
```

- **`jackrong-v2` is the merge base** — its full output style and reasoning
  policy survive intact at zero deltas. The two coding teachers contribute
  only their *delta from the official Qwen3.5-4B base*, not their absolute
  representations. This isolates "what the coder fine-tunes added on top of
  the public base" and grafts that onto the reasoning-distilled model.
- **DAREx-q 0.85** drops the bottom 85% of cf/jp deltas by magnitude
  (per-tensor quantile) before random pruning, then rescales by 1/density.
  This kills low-magnitude noise while preserving the high-amplitude
  code-skill structure.
- **Fisher importance** from forward-pass gradient maps over the coder
  fine-tunes' own training-style data weights the EMR election so dominant
  per-element directions win when the two coding teachers disagree.
- **PR682-turbo** protects critical layers (norms, embeddings, lm_head,
  biases) at density 1.0 and falls back gracefully on shape mismatch.

## Sources

| Model | Role | Weight |
|---|---|---:|
| [`Qwen/Qwen3.5-4B`](https://huggingface.co/Qwen/Qwen3.5-4B) | task base (delta reference) | — |
| [`Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled-v2`](https://huggingface.co/Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled-v2) | merge base | 1.0 (passthrough at δ=0) |
| `continuum-code-forged` | code teacher (delta) | 0.55 |
| `jackrong-python` | code teacher (delta) | 0.45 |

## Evaluation methodology

All evaluations: [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness),
llama.cpp `llama-server` with the published Q6_K quantization,
`/v1/completions` raw endpoint, greedy decoding (`temperature=0.0,
top_p=1.0`), `max_gen_toks=2048` for HE/MBPP, `max_gen_toks=8192` for LCB,
`--parallel 2 --cache-type-k q8_0 --cache-type-v q8_0`.

LiveCodeBench: medium-difficulty functional problems with
`min_date=2024-10-01` (post-Qwen3.5 training cutoff to avoid contamination).
LCB-30 = first 30 problems of that pool, LCB-55 = full pool of 55.

## Experiment trail (why this recipe?)

19 variants were ablated over a multi-week sweep. Summary table for the
informative subset:

| variant | merge form | AIME | HE | MBPP | LCB-30 | verdict |
|---|---|---:|---:|---:|---:|---|
| base | Qwen3.5-4B | 0.00 | 60.4 | 46.0 | 3.33 | floor |
| jackrong-v2 | source | **26.67** | 60.4 | 45.0 | 23.3 | strong reasoning, weak LCB |
| v2g | 3-src DARE-TIES, fisher+darex | 0.00 | 56.1 | **54.0** | 26.7 | code champion (no AIME) |
| **v2i = MicroCoder** | task-arith on jv-base | 3.33 | **57.3** | 52.0 | **26.7** | **balanced — picked** |
| v2j | v2i + skip mlp.gate_proj 18-25, darex 0.92 | 10.00 | — | — | — | first AIME signal |
| v2k | v2j + wider skip 14-27 | 0.00 | — | — | — | over-blocked, collapsed |
| v2l | v2j + full MLP skip 18-25 | 3.33 | — | — | — | up/down_proj carry code skill |
| v2m | v2j + density 0.45 | 3.33 | — | — | — | lower density hits jv harder |
| v2n | v2j + darex 0.95 | **13.33** | 55.5 | 50.8 | 20.0 | reasoning ceiling |
| v2o | v2n + darex 0.97 | 13.33 | 56.7 | 51.0 | 16.7 | saturated |
| v2p | v2n + jv-AIME fisher mask α=1.0 | 13.33 | 55.5 | 50.8 | 20.0 | mask redundant |
| v2q | v2n + jv-AIME mask α=0.5 | 13.33 | 54.9 | 51.2 | 20.0 | mask redundant |
| v2r | mask α=1.0 alone, no skip | 3.33 | — | — | — | per-element scaling cannot replace layer skip |

### Key findings (apply to future merge work)

1. **Task-arithmetic with the strong source as merge_base wins over symmetric
   DARE-TIES** when one source is much stronger on the target axis (here:
   reasoning). v2g and v2i tie on LCB-55 (27.27%) but v2i wins HE/HE+/GSM8K
   and retains a small AIME signal that pure DARE-TIES kills.

2. **Skip mlp.gate_proj layers 18-25 is the load-bearing AIME-recovery knob**
   (+6.7pp). This maps from Qwen3.6's think-policy band 27-52/64 → 32-layer
   Qwen3.5 = 14-26 conservative narrow 18-25. Wider bands (v2k 14-27)
   collapse; full-MLP skip (v2l) destroys code skill.

3. **DAREx-q 0.92 → 0.95 adds 3.3pp AIME on top of the skip** by killing more
   low-magnitude cf/jp deltas in the protected reasoning band. **0.95 → 0.97
   saturates** (v2n=v2o on AIME).

4. **The jv-AIME fisher suppression mask is fully redundant with skip-layers**
   (v2n=v2p=v2q at AIME 13.33 *and* code metrics within noise). Per-element
   scaling cannot substitute for layer-level passthrough — jv's reasoning
   lives in the *coherent per-layer behavior* of mlp.gate_proj 18-25, not in
   the highest-importance individual cells. Mask alone (v2r) gives nothing.

5. **The 13.33% AIME ceiling is structural, not a tuning problem.** Three
   different mechanisms (high darex, higher darex, mask) all converge at
   the same number. Closing the remaining 13.34pp gap to jv source requires
   SFT distillation, not more merge tuning.

### Why no AIME on the chosen variant?

MicroCoder (v2i) is the **code-leaning frontier point**. The skip-layer
recipe (v2n) recovers AIME to 13.33% but at a 6.7pp LCB-30 regression.
v2i preserves the better LCB; the trade is real and structural. A
reasoning-leaning sibling exists internally (v2n) but is not published —
LCB regression makes it strictly worse than `jackrong-v2` for math users
who already have access to the original.

## Files

- Full-precision safetensors weights (BF16). Use [`ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF`](https://huggingface.co/ManniX-ITA/Qwen3.5-4B-MicroCoder) for the Q6_K quantization.

## Use

```bash
llama-server -m Qwen3.5-4B-MicroCoder-Q6_K.gguf \
    --port 8099 -c 32768 -t 12 -ngl 99 \
    --parallel 2 --cache-type-k q8_0 --cache-type-v q8_0
```

Greedy `temperature=0.0, top_p=1.0` recommended for code tasks.

## Citation

If you use this model or the OmniMergeKit recipes in your work:

```
@misc{mannix2026microcoder,
  title  = {Qwen3.5-4B-MicroCoder: a task-arithmetic merge for code},
  author = {Mannix, F.},
  year   = {2026},
  url    = {https://huggingface.co/ManniX-ITA/Qwen3.5-4B-MicroCoder},
  note   = {Built with OmniMergeKit, https://github.com/mann1x/omnimergekit}
}
```

## License

Apache 2.0, inherited from Qwen3.5-4B and the source fine-tunes.