Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
ManniX-ITAΒ 
posted an update 16 days ago
Post
3010
πŸš€ Two releases this week pushing merge methodology forward.

β–Ά Qwen3.6-27B-Omnimerge-v4-MLP
ManniX-ITA/Qwen3.6-27B-Omnimerge-v4

Same-base DARE-TIES merge of Qwen3.6-27B + 3 fine-tunes (rico03 Claude distill, Esper3.1, kai-os Opus reasoning anchor) via my Omnimerge_v2 method (OBIM-lite + DAREx-q + EMR election).

Hit a Qwen3.6-specific fragility: hyperparams that work flawlessly on 3.5 produced 80% unclosed-<think> on 3.6, collapsing pass@1 to ~20%. Per-tensor delta forensics localized the failure to mlp.{gate,up,down}_proj in
layers 27–52. Fix: MLP-passthrough surgery β€” copy MLPs verbatim from base, keep merged attn + linear_attn. Leak β†’ 0%.

Q6_K results (vs Qwen3.6 base / vs Omnimerge-v2 on Qwen3.5):
β€’ HumanEval: 84.76% (= base, +5.49 pp vs v2)
β€’ MBPP corrected: 73.40% (+15.80 pp vs base, β‰ˆ v2)
β€’ GPQA Diamond: ~84.75% partial 192/198 (+15.5 pp vs v2)

β–Ά Qwen3.5-4B Importance-Signal Study (M1..M5)

Controlled 5-way comparison: same Qwen3.5-4B base, same 2 fine-tunes (Jackrong Claude-4.5 distill + Crow Opus-4.6 distill), only the importance signal driving DARE-TIES sparsification varies.

Q6_K HE / MBPP pass@1:
β€’ M1 Vanilla DARE-TIES β†’ 51.22 / 47.00
β€’ M2 OMv2 (no signal) β†’ 52.44 / 49.40
β€’ M3 OMv2 + Fisher β†’ 57.93 πŸ₯‡ / 48.80
β€’ M4 mergekit ex-LRP (PR #682) β†’ 51.22 / 49.40
β€’ M5 OMv2 + LRP β†’ 53.05 / 51.40 πŸ₯‡

Findings: Fisher wins HE (+4.88 pp over vanilla), LRP wins MBPP (+2.60 pp). Both signals + Omnimerge_v2 recipe beat vanilla. To make multimodal-LM ex-LRP work end-to-end against Qwen3_5ForConditionalGeneration, I filed
5 patches against arcee-ai/mergekit PR #682 + 1 against rachtibat/lxt.

All five Mx checkpoints + Fisher/LRP signal safetensors + reproducer scripts published.

πŸ”„ Follow-up to the M1..M5 study above β€” re-ran M4 against the updated PR #682 head ("turbo" branch by @Tusm11, HEAD 8d989f6) with rebalanced hyperparams (equal weights w=1/1, density 0.7, closer to the PR's worked example). Same AttnLRP signal as M4-orig, same sources.

β–Ά Qwen3.5-4B-M4-v2-ex-LRP-turbo
ManniX-ITA/Qwen3.5-4B-M4-v2-ex-LRP-turbo

Q6_K HE / MBPP pass@1, M4-v2 inserted:

  • M1 Vanilla DARE-TIES β†’ 51.22 / 47.00
  • M2 OMv2 (no signal) β†’ 52.44 / 49.40
  • M3 OMv2 + Fisher β†’ 57.93 πŸ₯‡ / 48.80
  • M4 ex-LRP (PR #682 orig) β†’ 51.22 / 49.40
  • M4-v2 ex-LRP (PR #682 turbo) β†’ 55.49 / 52.20 πŸ₯‡
  • M5 OMv2 + LRP β†’ 53.05 / 51.40

Ξ” M4-v2 vs M4-orig: +4.27 pp HE, +2.80 pp MBPP. M4-v2 takes the MBPP medal of the whole study (overtakes M5) while staying competitive on HumanEval. The turbo code path + rebalanced hyperparams clearly beat the original PR head on this configuration.

Findings refresh: Fisher still leads HE; ex-LRP (turbo) now leads MBPP, narrowly ahead of OMv2+LRP. Both LRP variants land within 1 pp on MBPP β€” strong signal that LRP-driven sparsification is doing real work for code-gen on small Qwen merges.

Big thanks to @Tusm11 for the supercharged ex-LRP turbo head β€” multimodal support + Iron-Man stabilization + in-place math are a real upgrade. Posted full results + the 6 patches needed to run it against Qwen3_5ForConditionalGeneration on the PR thread:
https://github.com/arcee-ai/mergekit/pull/682

In this post