--- base_model: Qwen/Qwen3.5-4B license: apache-2.0 library_name: transformers tags: - qwen3.5 - merge - omnimerge - task-arithmetic - code --- # Qwen3.5-4B-MicroCoder A 4B-parameter code-leaning merge of Qwen3.5-4B that beats every individual source on LCB-medium-55, while preserving full GSM8K parity with the strongest reasoning fine-tune in the pool. This card documents `Qwen3.5-4B-MicroCoder` (internally `v2i-jv-base-task-arith`), the chosen frontier point of a 19-variant ablation that swept merge methods, density, importance signals, AIME-protection masks, and skip-layer surgery. Built with [**OmniMergeKit**](https://github.com/mann1x/omnimergekit) — the open-source merge engine developed for this work. ## Headline numbers (Q6_K, greedy) | Benchmark | base Qwen3.5-4B | jackrong-v2 (best source) | **MicroCoder** | Δ vs source | |---|---:|---:|---:|---:| | HumanEval (164q) | 60.37 | 60.37 | **57.32** | −3.05 | | MBPP (500q) | 46.00 | 45.00 | **52.00** | **+7.00** | | LiveCodeBench-30 (medium, post-2024-10-01) | 3.33 | 23.33 | **26.67** | **+3.34** | | LiveCodeBench-55 (full medium pool) | — | 25.45 | **27.27** | **+1.82** | | HumanEvalPlus (164q) | — | 54.88 | 50.00 | −4.88 | | GSM8K (100q) | — | 83.00 | **83.00** | 0.00 | | MMLU-Pro (200q) | — | 56.81 | 52.46 | −4.35 | | AIME (30q) | — | 26.67 | 3.33 | −23.34 | **Net:** +7pp MBPP, +3.3pp LCB-30, +1.8pp LCB-55, GSM8K parity. Trade-offs are HumanEval (−3pp), MMLU-Pro (−4.4pp), and the AIME math-reasoning floor (see "Why no AIME?" below). ## Recipe ```bash python omnimergekit.py \ --base Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled-v2 \ --task-base Qwen/Qwen3.5-4B \ --source coder_eval/continuum-code-forged \ --source coder_eval/jackrong-python \ --method omnimerge_v2 --v2-features fisher,darex \ --weights 0.55,0.45 --density 0.53 --darex-q 0.85 \ --fisher continuum-forged.safetensors,jackrong-python.safetensors \ --pr682-turbo \ --seed 42 --device cuda ``` This is a **task-arithmetic** merge: ``` MicroCoder = jackrong-v2 + 0.55·DARE(continuum-code-forged − base) + 0.45·DARE(jackrong-python − base) ``` - **`jackrong-v2` is the merge base** — its full output style and reasoning policy survive intact at zero deltas. The two coding teachers contribute only their *delta from the official Qwen3.5-4B base*, not their absolute representations. This isolates "what the coder fine-tunes added on top of the public base" and grafts that onto the reasoning-distilled model. - **DAREx-q 0.85** drops the bottom 85% of cf/jp deltas by magnitude (per-tensor quantile) before random pruning, then rescales by 1/density. This kills low-magnitude noise while preserving the high-amplitude code-skill structure. - **Fisher importance** from forward-pass gradient maps over the coder fine-tunes' own training-style data weights the EMR election so dominant per-element directions win when the two coding teachers disagree. - **PR682-turbo** protects critical layers (norms, embeddings, lm_head, biases) at density 1.0 and falls back gracefully on shape mismatch. ## Sources | Model | Role | Weight | |---|---|---:| | [`Qwen/Qwen3.5-4B`](https://huggingface.co/Qwen/Qwen3.5-4B) | task base (delta reference) | — | | [`Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled-v2`](https://huggingface.co/Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled-v2) | merge base | 1.0 (passthrough at δ=0) | | `continuum-code-forged` | code teacher (delta) | 0.55 | | `jackrong-python` | code teacher (delta) | 0.45 | ## Evaluation methodology All evaluations: [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), llama.cpp `llama-server` with the published Q6_K quantization, `/v1/completions` raw endpoint, greedy decoding (`temperature=0.0, top_p=1.0`), `max_gen_toks=2048` for HE/MBPP, `max_gen_toks=8192` for LCB, `--parallel 2 --cache-type-k q8_0 --cache-type-v q8_0`. LiveCodeBench: medium-difficulty functional problems with `min_date=2024-10-01` (post-Qwen3.5 training cutoff to avoid contamination). LCB-30 = first 30 problems of that pool, LCB-55 = full pool of 55. ## Experiment trail (why this recipe?) 19 variants were ablated over a multi-week sweep. Summary table for the informative subset: | variant | merge form | AIME | HE | MBPP | LCB-30 | verdict | |---|---|---:|---:|---:|---:|---| | base | Qwen3.5-4B | 0.00 | 60.4 | 46.0 | 3.33 | floor | | jackrong-v2 | source | **26.67** | 60.4 | 45.0 | 23.3 | strong reasoning, weak LCB | | v2g | 3-src DARE-TIES, fisher+darex | 0.00 | 56.1 | **54.0** | 26.7 | code champion (no AIME) | | **v2i = MicroCoder** | task-arith on jv-base | 3.33 | **57.3** | 52.0 | **26.7** | **balanced — picked** | | v2j | v2i + skip mlp.gate_proj 18-25, darex 0.92 | 10.00 | — | — | — | first AIME signal | | v2k | v2j + wider skip 14-27 | 0.00 | — | — | — | over-blocked, collapsed | | v2l | v2j + full MLP skip 18-25 | 3.33 | — | — | — | up/down_proj carry code skill | | v2m | v2j + density 0.45 | 3.33 | — | — | — | lower density hits jv harder | | v2n | v2j + darex 0.95 | **13.33** | 55.5 | 50.8 | 20.0 | reasoning ceiling | | v2o | v2n + darex 0.97 | 13.33 | 56.7 | 51.0 | 16.7 | saturated | | v2p | v2n + jv-AIME fisher mask α=1.0 | 13.33 | 55.5 | 50.8 | 20.0 | mask redundant | | v2q | v2n + jv-AIME mask α=0.5 | 13.33 | 54.9 | 51.2 | 20.0 | mask redundant | | v2r | mask α=1.0 alone, no skip | 3.33 | — | — | — | per-element scaling cannot replace layer skip | ### Key findings (apply to future merge work) 1. **Task-arithmetic with the strong source as merge_base wins over symmetric DARE-TIES** when one source is much stronger on the target axis (here: reasoning). v2g and v2i tie on LCB-55 (27.27%) but v2i wins HE/HE+/GSM8K and retains a small AIME signal that pure DARE-TIES kills. 2. **Skip mlp.gate_proj layers 18-25 is the load-bearing AIME-recovery knob** (+6.7pp). This maps from Qwen3.6's think-policy band 27-52/64 → 32-layer Qwen3.5 = 14-26 conservative narrow 18-25. Wider bands (v2k 14-27) collapse; full-MLP skip (v2l) destroys code skill. 3. **DAREx-q 0.92 → 0.95 adds 3.3pp AIME on top of the skip** by killing more low-magnitude cf/jp deltas in the protected reasoning band. **0.95 → 0.97 saturates** (v2n=v2o on AIME). 4. **The jv-AIME fisher suppression mask is fully redundant with skip-layers** (v2n=v2p=v2q at AIME 13.33 *and* code metrics within noise). Per-element scaling cannot substitute for layer-level passthrough — jv's reasoning lives in the *coherent per-layer behavior* of mlp.gate_proj 18-25, not in the highest-importance individual cells. Mask alone (v2r) gives nothing. 5. **The 13.33% AIME ceiling is structural, not a tuning problem.** Three different mechanisms (high darex, higher darex, mask) all converge at the same number. Closing the remaining 13.34pp gap to jv source requires SFT distillation, not more merge tuning. ### Why no AIME on the chosen variant? MicroCoder (v2i) is the **code-leaning frontier point**. The skip-layer recipe (v2n) recovers AIME to 13.33% but at a 6.7pp LCB-30 regression. v2i preserves the better LCB; the trade is real and structural. A reasoning-leaning sibling exists internally (v2n) but is not published — LCB regression makes it strictly worse than `jackrong-v2` for math users who already have access to the original. ## Files - Full-precision safetensors weights (BF16). Use [`ManniX-ITA/Qwen3.5-4B-MicroCoder-GGUF`](https://huggingface.co/ManniX-ITA/Qwen3.5-4B-MicroCoder) for the Q6_K quantization. ## Use ```bash llama-server -m Qwen3.5-4B-MicroCoder-Q6_K.gguf \ --port 8099 -c 32768 -t 12 -ngl 99 \ --parallel 2 --cache-type-k q8_0 --cache-type-v q8_0 ``` Greedy `temperature=0.0, top_p=1.0` recommended for code tasks. ## Citation If you use this model or the OmniMergeKit recipes in your work: ``` @misc{mannix2026microcoder, title = {Qwen3.5-4B-MicroCoder: a task-arithmetic merge for code}, author = {Mannix, F.}, year = {2026}, url = {https://huggingface.co/ManniX-ITA/Qwen3.5-4B-MicroCoder}, note = {Built with OmniMergeKit, https://github.com/mann1x/omnimergekit} } ``` ## License Apache 2.0, inherited from Qwen3.5-4B and the source fine-tunes.