Qwen3.6-35B-A3B for hipfire

Pre-quantized Qwen3.6-35B-A3B (MoE, 35B total / 3B activated) for hipfire, a Rust-native LLM inference engine for AMD RDNA GPUs.

Quantized from Qwen/Qwen3.6-35B-A3B. Architecture: 256 experts top-8, hybrid DeltaNet + Full Attention (3:1), head_dim=256 / partial_rotary_factor=0.25, shared expert, tied embeddings — loaded by hipfire's arch_id=6 path with no engine changes.

What's new in this release — a graded-quant SKU ladder

This release replaces the prior two-file (.mq3/.mq4) drop with a full size→quality ladder, and adds per-expert graded mixed-precision SKUs. The key result: in a heavy-tailed MoE, ~80% of routing contribution comes from the top 20% of experts, so a graded file can put the hot 20% of experts at high precision and the cold tail at low precision and beat a uniform quant of the same size. .mq3 here is such a file — it matches MQ4 quality at MQ3 size.

Every SKU keeps the router + embed/lm_head pinned at Q8F16 (the issue-#171 4-bit-router attractor fix — non-negotiable; a 4-bit router collapses to a structural attractor on agentic prompts, unique-word ratio 14% → 46% with the fix). KLD is measured per-token (q8 KV) against an f32-native oracle of the same model; see KLD-AWQ-instructions.md and the .f32.hfq / kldref/ files in this directory.

Files

KLD = KL-divergence vs the f32 oracle (lower is better). wt2 = WikiText-2 (general), agentic = code/agentic corpus. f32 oracle reference PPL: wt2 5.350, agentic 5.902.

Naming: plain .mqN = uniform N-bit. The +P suffix (filename .mqNp) = promoted — a graded build at that size tier whose hot experts are promoted to higher precision and cold experts demoted, beating/matching the uniform quant of the same size. (Filenames use p not + to stay URL/CLI-safe; cards display "+P".)

File	Experts	Size	wt2 / ag KLD	Min VRAM	Serve	Notes
`qwen3.6-35b-a3b.mq2`	uniform MQ2-Lloyd	11.6 GB	0.238 / 0.466	~14 GB	full	Floor SKU — smallest, coherent but visibly degraded; memory-constrained hosts.
`qwen3.6-35b-a3b.mq3p` ⭐	MQ3+P promoted (hot MQ6 / mid MQ4 / cold MQ2-Lloyd)	17.2 GB	0.0352 / 0.163	~20 GB	decode-optimized¹	MQ4 quality at MQ3 size. Dominates a uniform MQ3 (≈half the KLD).
`qwen3.6-35b-a3b.mq4`	uniform MQ4 + Q8 router (dense-AWQ)	19.7 GB	0.0339 / 0.163 ⚠️	~22 GB	full	⚠️ NOT deployable on gfx11 / gfx1151. The 19.7 GB dense-AWQ build forwards to garbage on RDNA3 dGPU (gfx11) and RDNA3.5 iGPU (gfx1151) — token-loops on a bare factual prompt (2026-06-17 night-2). It is coherent only on gfx12. The KLD 0.0339 / 0.163 was measured on a single arch and does not reflect gfx11/gfx1151 output — the arch-invariant-KLD assumption (see perf table) does NOT hold for dense-AWQ: the dense-AWQ recipe is wrong for an MoE model (AWQ scales for dense layers ≠ per-expert weights → lobotomized off-gfx12). Use `.mq4p` as the robust 4-bit default instead (coherent on all arches).
`qwen3.6-35b-a3b.mq4p` ⭐	MQ4+P promoted (hot MQ6 / mid MQ4 / cold MQ3-Lloyd)	19.8 GB	0.0249 / 0.133	~22 GB	decode-optimized¹	The robust 4-bit default — coherent on gfx11 / gfx12 / gfx1151 (unlike `.mq4`, see above). Beats uniform MQ4 by −26% / −18% KLD at iso-size.
`qwen3.6-35b-a3b.mfp4` ⭐	uniform MFP4-E8 (E8 lattice VQ, 4.25 bpw)	20.2 GB	0.0248 / 0.121	~22 GB	full	E8 vector-quant — beats MQ4+P on both axes at iso-size; the 4-bit quality leader. mfp4e8-gptq final form.
`qwen3.6-35b-a3b.mq5`	uniform MQ5	23.7 GB	0.0191 / 0.106	~26 GB	full	Quality SKU — ~80% of the way from MQ4 to f32.
`qwen3.6-35b-a3b.mq6`	uniform MQ6	27.7 GB	0.0159 / 0.0868	~30 GB	full	Max quality; for when VRAM isn't the constraint.

¹ .mq3p / .mq4p are mixed-precision and serve at full speed in decode; their Lloyd cold tier currently uses a per-token prefill fallback (slower TTFT, no decode-throughput impact). A batched-prefill path for the cold tiers is landing — see Engine requirements below.

Reference / tooling files (not for inference; for KLD measurement + further quantization):

qwen3.6-35b-a3b.f32.hfq — f32-native oracle (138.7 GB). The ground truth all KLD numbers are measured against.
kldref/wt2.kldref.bin, kldref/agentic.kldref.bin — precomputed f32 logit references for the two corpora.
qwen3.6-35b-a3b.imatrix.gguf — activation importance matrix (used for AWQ scaling + the graded hot-set ranking; reusable for GPTQ).

The promoted (`+P`) SKUs

.mq3p and .mq4p are graded mixed-precision — per-expert tiering by REAP importance (union-ranked across both corpora so the hot-set transfers cross-domain), with the hot 20% promoted to MQ6, the warm 30% at MQ4, and the cold 50% demoted to the Lloyd floor (MQ2-Lloyd for .mq3p, MQ3-Lloyd for .mq4p). The win comes from spending bits where routing actually goes: .mq4p beats uniform .mq4 at the same size; .mq3p reaches .mq4 quality at MQ3 size.

Both currently serve at full decode speed; their Lloyd cold tier uses a per-token prefill fallback (slower TTFT only) until the batched-prefill MQ3-Lloyd kernel lands, after which .mq4p is promoted to supersede uniform .mq4 as the default. Treat .mq4p as a preview/eval artifact until then.

Engine requirements

The graded SKUs (.mq3p, .mq4p) and the MQ5/MQ6 expert paths require a hipfire build with the per-expert dtype-tag MoE kernels (the graded-quant release). Uniform .mq2/.mq4 run on current master. Check hipfire --version against the release notes; if .mq3p fails to load with a dtype-dispatch error, your engine predates the graded-MoE kernels.

Performance — per-arch prefill / decode

Measured at hipfire build 05a030ac (branch feat/moe-awq-experts), all boxes on the same commit. prefill = tok/s at pp512 / pp1024 (4096-ctx, warmed); dec = decode tok/s. kv-mode f32, warmed median. Three arches — note gfx11 (dGPU) and gfx1151 (iGPU) are NOT interchangeable: same wave32-WMMA op path, but ~960 GB/s GDDR6 + large L2 vs ~256 GB/s LPDDR5 unified gives very different rooflines.

gfx11 — RX 7900 XTX (RDNA3 dGPU, 24 GB) · gfx12 — R9700 (RDNA4, 32 GB) · gfx1151 — Ryzen AI MAX+ 395 / Strix Halo (RDNA3.5 iGPU, 96 GB carveout)

SKU	gfx11 pp512/1024 · dec	gfx12 pp512/1024 · dec	gfx1151 pp512/1024 · dec
mq2	120 / 119 · 124	102 / 102 · 103	64 / 63 · 64
mq3p	1156 / 1146 · 118	2578 / 2492 · 100	582 / 578 · 61
mq4	1239 / 1218 · 126	2793 / 2686 · 99	615 / 604 · 65
mq4p	1053 / 1049 · 116	2221 / 2182 · 99	525 / 526 · 60
mfp4	808 / 801 · 109	1461 / 1427 · 95	406 / 402 · 58
mq5	— (>24 GB)	96 / 95 · 97	59 / 59 · 60
mq6	— (>24 GB)	OOM (>32 GB)	573 / 566 · 58

Reading the table:

mq6 needs >32 GB — only the gfx1151 96 GB carveout fits it (27.7 GB weights + f32 KV + 248K-vocab logits OOMs on 24/32 GB). gfx11 caps at the 4-bit tier; gfx12 stops below mq6.
mq2 / mq5 run per-token prefill (~60–120 tok/s) — there is no grouped-WMMA kernel for 2-bit / 5-bit experts yet, so they fall back to the decode-path GEMV. The graded tiers (mq3p / mq4 / mq4p) and mfp4-E8 batch via grouped-WMMA on all three arches (mfp4 is NOT per-token).
Prefill is compute-bound and near-optimal (corrected 2026-06-16). The earlier "no-AWQ 2928 gfx11 ceiling" was a lobotomized broken-rotate kernel — not a valid target; discard it. Coherent mq4 prefill (gfx11 1244) is compute-bound on the expert GEMM (~33% of i8-WMMA peak); the FWHT rotate is only ~0.5% of prefill. LDS-tiling does not help (gfx11 compute-bound; gfx12 not X-load-bound). The grouped prefill kernels are already near-optimal.
Decode ships hipGraph default-on (gfx11/gfx12; opt-out HIPFIRE_GRAPH=0): +4–8% over the graph-off table above (e.g. mq4p gfx11 116→125, gfx12 99→103), coherence-validated. Decode is launch- and batch-1-GEMV-bound (weight reads run at 16–31% of peak), NOT at the bandwidth floor — so the next decode lever is batching via MTP (multi-token), not more launch reduction.

KLD is arch-invariant (gfx11 ≡ gfx12 to 6 digits) so it is measured once and applies to every arch; only prefill/decode are per-arch. Caveat (night-2): this holds for the scalar-MagnumQuant and E8 tiers, but NOT for dense-AWQ .mq4 — see below.

Night-2 validation (2026-06-17 — antibleed + re-sweep)

The gfx1151/gfx11 antibleed divorce (commit ab220f25, branch feat/moe-awq-experts). The per-arch tuning now cleanly separates the RDNA3.5 iGPU (gfx1150/51/52, is_rdna3p5) from the RDNA3 dGPU (gfx1100/01/02, is_rdna3_dgpu) — the two were sharing a tuning path despite having very different rooflines (LPDDR5 unified vs GDDR6 + large L2). Capability gates stay on has_wmma_w32 (the wave32-WMMA op path is shared; only the tuning divorces).

The divorce is behavior-preserving: gfx1100 committed-token-ids are byte-identical pre/post (md5 da7b4505) and gfx1151 is byte-identical (md5 122261ca). It fixes 3 off-fleet correctness bugs: a gfx1150 admit-vs-select panic, a gfx1103/1152 wrong-reject, and a gfx1152 launch under-cover.

Coherence re-validated post-divorce on a bare FACTUAL prompt (code prompts mask the dense-AWQ lobotomy — use a plain factual prompt to expose it): .mq4p and .mfp4 PASS on gfx11 + gfx12 + gfx1151 (all 8 detectors 0/0). The gfx1151 uniform-HFQ4 kernel was proven innocent — a known-good plain uniform mq4 decodes coherent on gfx1151, so the .mq4 garbage is the dense-AWQ recipe, not the kernel.

Production-bench deployable perf (bench_qwen35_mq4, production forward_prefill_batch, q8 KV, fresh-process median-of-3, hipGraph default-on) — pp512 prefill / decode tok/s:

SKU	gfx11 (RX 7900 XTX)	gfx12 (R9700)	gfx1151 (Strix Halo)
mq4p	2494 / 112.6	2248 / 48.5	992 / 55.8
mfp4	1498 (E8-batched) / 105.5	1484 / 93.1	622 / 54.4

NOTE on the numbers: these production-bench absolutes run ~2.4× higher than the per-arch table above — this is a measurement-method difference, NOT a real gain. The divorce itself is perf-neutral (verified becc0610 == ab220f25). The original per-arch table above remains the consistent-method baseline; treat these two tables as different rulers, not before/after.

Deployable trio (night-2 verdict): .mq4p (graded HQ — robust 4-bit default) + .mfp4 (E8 quality leader). .mq2 / .mq3p remain available. .mq4 (dense-AWQ) is NOT gfx11/gfx1151-safe — gfx12-only.

Usage

# Install hipfire
curl -L https://raw.githubusercontent.com/Kaden-Schutt/hipfire/master/scripts/install.sh | bash

# Pull the model (defaults to .mq4, the safe uniform default)
hipfire pull qwen3.6:35b-a3b
hipfire run qwen3.6:35b-a3b "Write a Rust function that parses an ISO-8601 date."

# Pull a specific SKU explicitly
hf download schuttdev/hipfire-qwen3.6-35b-a3b qwen3.6-35b-a3b.mq3p --local-dir ~/.hipfire/models

Configuration notes

thinking:off recommended — A3B is a heavy thinker; default thinking-mode prompts produce long reasoning chains that can loop on complex tasks. hipfire config qwen3.6:35b-a3b set thinking off.
dflash_mode: auto — speculative decoding stays off for A3B unless a cask_sidecar is configured (drafts reject most non-math tokens, τ≈1.0–1.5).
Greedy + RP=1.05 is the safest sampler for this model (issue-#171 matrix).

Quantization format

All formats are FWHT-rotated (incoherence processing) with an 8-byte affine or fp16-codebook group header at group size 256.

MagnumQuant ladder — MQ2 (2.25 B/w) · MQ3 (3.25) · MQ4 (4.25) · MQ5 (5.25) · MQ6 (6.25). MQ2/MQ3 here use the Lloyd codebook variant (optimal scalar quant on the rotated weights).
MFP4-E8 (.mfp4, 4.25 bpw) — E8-lattice vector quantization, a different axis from the scalar MagnumQuant ladder. Rather than quantize each rotated weight independently, it quantizes in 8-dimensional blocks onto the E8 lattice (the densest 8-D sphere packing) with an E4M3 (FP8) per-block scale; the container is the +P mfp4 frame (no codebook prefix) with each 32-element group's nibbles replaced by 32-bit E8 codewords. Exploiting cross-channel correlation the scalar quant can't see, at iso-size (4.25 bpw) it beats both uniform MQ4 and graded MQ4+P on KLD (0.0248 / 0.121) — the 4-bit quality leader. It decodes through the same grouped-WMMA expert path as the MagnumQuant tiers (a per-block lattice→f16 unpack feeding the WMMA GEMM), so it batches in prefill on all three arches (it is not a per-token format). The decode cost of the lattice unpack is hidden behind the load-bound expert GEMM, so decode tok/s is on par with the scalar 4-bit tiers.
Graded mixed-precision — per-expert tiering by REAP importance (a per-(layer,expert) gate × ‖output‖ contribution measure). The hot-set is a union ranking across both corpora (general ∪ agentic) so it transfers cross-domain. .mq3p = top-20% experts → MQ6, next-30% → MQ4, cold-50% → MQ2-Lloyd, applied to both gate_up and down projections. .mq4p uses MQ3-Lloyd as the cold tier instead.
Router/embed/lm_head are pinned at Q8F16 in every SKU (issue-#171).

See KLD-AWQ-instructions.md (this directory) for the full measurement + (re)quantization methodology, including imatrix/AWQ and GPTQ.

Quantized on AMD MI300X; validated on the RDNA3/3.5/4 fleet (gfx1100 / gfx1151 / gfx1201). hipfire is a Rust-native, no-Python-hot-path inference engine for consumer + datacenter AMD GPUs.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for schuttdev/hipfire-qwen3.6-35b-a3b

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

(159)

this model