plunderstruck's picture
Repoint build instructions to charlie12345/ROCmFPX (ROCmFPX FP3/4/6/8 repo)
9acfd2c verified
|
Raw
History Blame Contribute Delete
24.4 kB
---
base_model: Qwen/Qwen3-Coder-Next
license: apache-2.0
library_name: gguf
tags:
- gguf
- rocmfp4
- qwen3next
- qwen3-coder-next
- coder
- moe
- imatrix
- strix-halo
- amd
- rocm
- vulkan
language:
- en
base_model_relation: quantized
---
<div style="border:2px solid currentColor; font-family:ui-monospace,'SF Mono','Cascadia Mono',Consolas,'Liberation Mono',monospace;">
<div style="border-bottom:1px solid currentColor; padding:6px 12px; font-size:11px; letter-spacing:3px; text-transform:uppercase; opacity:0.7; text-align:center;">PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO Β· gfx1151</div>
<div style="padding:14px; display:flex; flex-wrap:wrap; align-items:center; justify-content:center; gap:18px;">
<pre style="margin:0; flex:0 0 auto; font-family:ui-monospace,'SF Mono','Cascadia Mono',Consolas,monospace; font-size:5px; line-height:1.1; letter-spacing:0;">
β–—β–‡β–‡β–‡β–‡β–‡β–‡β–‡β––
β–—β–ˆβ–˜β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ––
β–—β–› β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–†β–†β–†β–†β–†β–†β–†β–†β–…
β–Ÿβ–› β–—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–™β––
β–„β–„β–„β–„β–„β–Ÿβ–› β–Ÿβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ––
β–—β–ˆβ–ˆβ–Œ β–šβ–– β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–”β–ˆβ–˜
β–—β–ˆβ–ˆβ–ˆβ–ˆβ–– β–œβ–– β–—β–ˆβ–˜
β–œβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–™ β–œβ–†β–†β–†β–†β–†β–†β–†β–†β–†β–†β–†β–†β–†β–†β–†β–€β–€β–€β–€β–€β–œβ–™
β–œβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–™ β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–› β–œβ–™
β–œβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–™ β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–› β–ƒ β–œβ–™
β–€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–™β–– β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–˜ β–Ÿβ–ˆβ–™ β–€β–™
β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–– β–β–œβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–˜ β–Ÿβ–ˆβ–ˆβ–ˆβ–™β–‚β–‚β–‚β–‚β–β–ˆ
β–Ÿβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–– β–œβ–ˆβ–ˆβ–ˆβ–˜ β–—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–›
β–Ÿβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–„ β–œβ–› β–—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–€
β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–€ β–—β–› β–—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–€β–€β–€β–€β–€β–˜
β–œβ–ˆβ–ˆβ–˜ β–—β–› β–Ÿβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–›β–˜
β–œβ–ˆβ–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–ˆβ–– β–Ÿβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–›
β–β–ˆβ–– β–Ÿβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–›
β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–€
</pre>
<div style="flex:0 1 auto; max-width:100%; text-align:center;">
<div style="font-size:23px; font-weight:800; letter-spacing:1px;">QWEN3-CODER-NEXT</div>
<div style="font-size:12.5px; letter-spacing:1px; opacity:0.8; margin-top:5px;"><span style="white-space:nowrap;">4-BIT ROCmFP4</span> Β· <span style="white-space:nowrap;">80B-A3B MoE</span> Β· <span style="white-space:nowrap;">CODE-WEIGHTED IMATRIX</span> Β· <span style="white-space:nowrap;">AGENTIC CODER</span> Β· <span style="white-space:nowrap;">SINGLE AMD APU</span></div>
</div>
</div>
<table style="display:table; table-layout:fixed; width:100%; margin:0; border-collapse:collapse; border-radius:0; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px;">
<tr>
<td style="border-top:1px solid currentColor; border-right:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">FORMAT</div><div style="font-weight:700;">ROCmFP4 4-BIT</div></td>
<td style="border-top:1px solid currentColor; border-right:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">PRECISION</div><div style="font-weight:700;">~4.5 BPW</div></td>
<td style="border-top:1px solid currentColor; border-right:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">ARCH</div><div style="font-weight:700;">QWEN3NEXT</div></td>
<td style="border-top:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">CONTEXT</div><div style="font-weight:700;">262 K</div></td>
</tr>
<tr>
<td style="border-top:1px solid currentColor; border-right:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">PARAMS</div><div style="font-weight:700;">80B Β· A3B MoE</div></td>
<td style="border-top:1px solid currentColor; border-right:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">DRAFT</div><div style="font-weight:700;">NO MTP</div></td>
<td style="border-top:1px solid currentColor; border-right:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">BACKEND</div><div style="font-weight:700;">VULKAN0</div></td>
<td style="border-top:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">LICENSE</div><div style="font-weight:700;">APACHE-2.0</div></td>
</tr>
</table>
</div>
<div style="border:2px solid #dc2626; padding:10px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12.5px; margin:14px 0;">
<b style="color:#dc2626; letter-spacing:1px;">⚠ REQUIRES THE ROCmFP4 FORK</b><br>
The custom <code>q4_0_rocmfp4</code> / <code>q4_0_rocmfp4_fast</code> tensor types <b>will not load in stock llama.cpp, LM Studio, or Ollama</b>. Build/run with <a href="https://github.com/charlie12345/ROCmFPX">charlie12345/ROCmFPX</a> Β· branch <code>mtp-rocmfp4-strix</code>.
</div>
<div style="border:1px solid currentColor; padding:8px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px; margin:14px 0; opacity:0.85;">
<b>NOTE //</b> Ignore HuggingFace's auto-detected "F16"/16-bit badge β€” its parser can't read ROCmFP4 and mislabels the file. These are <b>~4.5 bpw 4-bit</b> ROCmFP4 files; pick by filename in <i>Files and versions</i>.
</div>
Experimental **AMD Strix Halo (gfx1151)** quant of [**Qwen3-Coder-Next**](https://huggingface.co/Qwen/Qwen3-Coder-Next) β€” Qwen's agentic coding model (**80B total / 3B active** high-sparsity MoE, hybrid Gated-DeltaNet attention, arch `qwen3next`, 262K context) β€” in the custom **ROCmFP4** 4-bit format, **imatrix-quantized** with a code-weighted importance matrix.
<div style="font-family:ui-monospace,'SF Mono',Consolas,monospace; font-weight:800; font-size:14px; letter-spacing:2px; text-transform:uppercase; border-bottom:2px solid currentColor; padding-bottom:5px; margin:26px 0 12px;"><span style="color:#ea580c;">01</span> Β· FILES</div>
<div style="overflow:hidden; border-radius:0;">
<table style="width:100%; border-collapse:collapse; border-radius:0; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12.5px;">
<thead><tr>
<th style="border:1px solid currentColor; padding:7px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">File</th>
<th style="border:1px solid currentColor; padding:7px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">Output head</th>
<th style="border:1px solid currentColor; padding:7px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">Pick if</th>
</tr></thead>
<tbody>
<tr><td style="border:1px solid currentColor; padding:7px 10px;"><code>…-STRIX-embQ8-imatrix-headQ6.gguf</code> β˜…</td><td style="border:1px solid currentColor; padding:7px 10px;">Q6_K</td><td style="border:1px solid currentColor; padding:7px 10px;"><b>the one build</b> β€” best speed/quality balance: Q8 embeddings + Q6 output head on the fast single-scale body</td></tr>
</tbody>
</table>
</div>
One file β€” the **best speed/quality balance** in ROCmFP4 for Strix Halo. It keeps the two quality levers that are actually *felt* β€” **Q8 token embeddings** (matching the Q8 source exactly) and a **Q6_K output head** β€” on the fast single-scale `q4_0_rocmfp4_fast` body + a code-weighted imatrix. Not the most faithful possible (see the fidelity link in Β§04) β€” it's the point where speed and quality meet best. The DeltaNet-specific tensors (`ssm_conv1d`, `ssm_a`, norms, router) stay **F32**; MoE experts + attention/SSM projections are 4-bit ROCmFP4.
<div style="border:1px solid currentColor; padding:8px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px; margin:12px 0; opacity:0.85;">
<b>NOTE //</b> <b>Q8 embeddings</b> (not f16): the source is Q8_0, so Q8 matches its precision exactly β€” f16 would be fake-f16 bloat for zero gain (embeddings are a lookup, not a matmul).
</div>
<div style="font-family:ui-monospace,'SF Mono',Consolas,monospace; font-weight:800; font-size:14px; letter-spacing:2px; text-transform:uppercase; border-bottom:2px solid currentColor; padding-bottom:5px; margin:26px 0 12px;"><span style="color:#ea580c;">02</span> Β· QUICK START</div>
Run from the folder holding the `.gguf` (the Qwen ChatML template is baked in β€” just pass `--jinja`):
```bash
env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
-m Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf \
--alias coder-next \
--host 0.0.0.0 \
--port 8080 \
-c 262144 \
-ctk q8_0 \
-ctv q8_0 \
--temp 0.7 \
--top-p 0.8 \
--top-k 20 \
-dev Vulkan0 \
-ngl 999 \
-fa on \
-b 2048 \
-ub 256 \
-t 16 \
-tb 16 \
-cpent 256 \
-ctxcp 32 \
--cache-reuse 256 \
--cache-ram 65536 \
--jinja \
--parallel 1 \
--metrics \
--no-mmap
```
<div style="overflow:hidden; border-radius:0;">
<table style="width:100%; border-collapse:collapse; border-radius:0; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px;">
<thead><tr>
<th style="border:1px solid currentColor; padding:6px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px; width:40%;">Flag</th>
<th style="border:1px solid currentColor; padding:6px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">Function</th>
</tr></thead>
<tbody>
<tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>HSA_OVERRIDE_GFX_VERSION=11.5.1</code></td><td style="border:1px solid currentColor; padding:6px 10px;">treat the APU as gfx1151 (Strix Halo)</td></tr>
<tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>GGML_HIP_ENABLE_UNIFIED_MEMORY=1</code></td><td style="border:1px solid currentColor; padding:6px 10px;">allow use of the full 128 GB unified memory</td></tr>
<tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>-dev Vulkan0</code></td><td style="border:1px solid currentColor; padding:6px 10px;">run on Vulkan β€” fastest backend for ROCmFP4 on Strix Halo</td></tr>
<tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>-ngl 999 Β· -fa on</code></td><td style="border:1px solid currentColor; padding:6px 10px;">offload all layers Β· flash attention</td></tr>
<tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>-c 262144</code></td><td style="border:1px solid currentColor; padding:6px 10px;">context length (256K)</td></tr>
<tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>-b 2048 Β· -ub 256 Β· -t/-tb 16</code></td><td style="border:1px solid currentColor; padding:6px 10px;">prefill batch / micro-batch Β· CPU threads</td></tr>
<tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>-ctk q8_0 Β· -ctv q8_0</code></td><td style="border:1px solid currentColor; padding:6px 10px;">q8_0 (8-bit) KV cache β€” how we run it; drop to <code>q4_0</code> to use less memory, or raise to <code>f16</code></td></tr>
<tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>-cpent Β· -ctxcp Β· --cache-reuse Β· --cache-ram 65536</code></td><td style="border:1px solid currentColor; padding:6px 10px;">cross-turn KV checkpointing + 64 GB resident reuse cache</td></tr>
<tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>--temp 0.7 --top-p 0.8 --top-k 20</code></td><td style="border:1px solid currentColor; padding:6px 10px;">Qwen-Coder recommended sampling</td></tr>
<tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>--jinja --parallel 1 --metrics --no-mmap</code></td><td style="border:1px solid currentColor; padding:6px 10px;">apply baked ChatML template Β· single slot Β· metrics Β· weights in RAM</td></tr>
</tbody>
</table>
</div>
<div style="border:1px solid currentColor; padding:8px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px; margin:12px 0; opacity:0.85;">
<b>NOTE //</b> No <code>--spec-*</code> / <code>--spec-type draft-mtp</code> flags β€” this arch has <b>no MTP head</b> (see Β§04). It's already fast on its own.
</div>
<div style="font-family:ui-monospace,'SF Mono',Consolas,monospace; font-weight:800; font-size:14px; letter-spacing:2px; text-transform:uppercase; border-bottom:2px solid currentColor; padding-bottom:5px; margin:26px 0 12px;"><span style="color:#ea580c;">03</span> Β· AGENTIC CODING / TOOLS</div>
Qwen3-Coder-Next is an **agentic coder** β€” built to call tools, not narrate code. To wire it up:
- **Chat template:** Qwen (ChatML) is baked into the GGUF β€” just pass `--jinja` and your client applies it automatically.
- **Tool calling:** enable the **`qwen3_coder`** tool-call parser in your client (e.g. the matching parser flag in llama-server / your agent harness). Without it, native tool calls won't be parsed and the model tends to narrate code instead of calling tools.
- **Sampling:** temp `0.7`, top-p `0.8`, top-k `20` (Qwen-Coder recommended) β€” already set in Β§02.
<div style="border:1px solid currentColor; padding:8px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px; margin:12px 0; opacity:0.85;">
<b>NOTE //</b> The cross-turn reuse cache (<code>--cache-reuse</code> / <code>--cache-ram</code>) keeps long agentic sessions cheap β€” the leading prompt isn't re-prefilled every turn.
</div>
<div style="font-family:ui-monospace,'SF Mono',Consolas,monospace; font-weight:800; font-size:14px; letter-spacing:2px; text-transform:uppercase; border-bottom:2px solid currentColor; padding-bottom:5px; margin:26px 0 12px;"><span style="color:#ea580c;">04</span> Β· PERFORMANCE &amp; QUALITY</div>
<div style="overflow:hidden; border-radius:0;">
<table style="width:100%; border-collapse:collapse; border-radius:0; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12.5px;">
<tbody>
<tr><td style="border:1px solid currentColor; padding:8px 11px; width:42%;">DECODE Β· short context</td><td style="border:1px solid currentColor; padding:8px 11px; font-weight:700;">~54 t/s (Vulkan / Ryzen AI Max+ 395)</td></tr>
<tr><td style="border:1px solid currentColor; padding:8px 11px;">SPECULATIVE DECODE</td><td style="border:1px solid currentColor; padding:8px 11px; font-weight:700;">none (no MTP head)</td></tr>
<tr><td style="border:1px solid currentColor; padding:8px 11px;">LONG CONTEXT</td><td style="border:1px solid currentColor; padding:8px 11px;">cheap β€” DeltaNet near-constant memory</td></tr>
<tr><td style="border:1px solid currentColor; padding:8px 11px;">QUANTIZATION</td><td style="border:1px solid currentColor; padding:8px 11px;">fast single-scale body + Q8 emb + Q6 head + code-weighted imatrix (measured win β€” below)</td></tr>
</tbody>
</table>
</div>
**This is the best speed/quality balance in ROCmFP4 β€” by design, not the absolute fastest.** On top of the imatrix + Q8 emb + Q6 head, we swept the body kernel against the Q8 source by **KL divergence** (the right fidelity metric). An all-dual-scale body did edge the fast single-scale body on KL, but the gain sat inside the measurement noise while costing decode speed β€” so the **fast single-scale body + Q8 embeddings + Q6 head** is the right point, and the one file we ship.
This mirrors the fuller sweep on our [**Qwen3.6-27B sibling**](https://huggingface.co/plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF), where every higher-precision body lever (all-dual-scale, selective Q5/Q6 bumps) bought a KL improvement inside the noise at a real speed cost β€” and where copying an entire dynamic-quant high-precision allocation onto ROCmFP4 *still* couldn't match a true dynamic K-quant, because FP4 is intrinsically less faithful than Q4_K's 4-bit. The same format limit applies here: within ROCmFP4, fast body + Q8 emb + Q6 head is the optimal balance; for maximum fidelity reach for a dynamic K-quant of the base (box below). *(Directional internal measurements β€” KL vs Q8 on held-out code; reproduce before citing.)*
<div style="border:1px solid currentColor; padding:8px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px; margin:12px 0; opacity:0.9;">
<b>WANT MAXIMUM FIDELITY INSTEAD OF SPEED?</b> Grab a <b>Q6_K / Q8 dynamic GGUF of the base</b> from <a href="https://huggingface.co/Qwen/Qwen3-Coder-Next"><b>Qwen/Qwen3-Coder-Next</b></a> β€” higher-bit GGUFs run on this same fork. We optimize for throughput in ROCmFP4; if you want the last bit of fidelity over speed, that's the one to grab.
</div>
**Fast even without speculative decoding.** 3B active params + linear Gated-DeltaNet attention β†’ ~54 t/s short-context decode on a Ryzen AI Max+ 395 (Vulkan0), and cheap long context. No MTP needed.
<div style="border:1px solid currentColor; padding:8px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px; margin:12px 0; opacity:0.85;">
<b>NOTE // NO MTP</b> Qwen3-Coder-Next ships <b>without</b> an MTP head, and the ROCmFP4 fork currently wires MTP drafting only for the <code>qwen35</code>/<code>qwen35moe</code> archs, <b>not</b> <code>qwen3next</code>. So these are <b>no-MTP</b> (non-speculative) builds β€” in practice it doesn't matter, it's fast on its own.
</div>
**The imatrix β€” code-weighted, and measured (a clean win here).** Quantized **with an importance matrix** built from a **code-weighted** calibration mix (~2.6:1 code:general): real multi-language source + code-analysis prompts from [`eaddario/imatrix-calibration`](https://huggingface.co/datasets/eaddario/imatrix-calibration), plus Kalomaze's `groups_merged` (via [`froggeric/imatrix`](https://huggingface.co/datasets/froggeric/imatrix)) for general.
KL-divergence + perplexity vs the **Q8 reference** on a **held-out code** slice (disjoint from calibration), imatrix vs no-imatrix:
<div style="overflow:hidden; border-radius:0;">
<table style="width:100%; border-collapse:collapse; border-radius:0; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12.5px;">
<thead><tr>
<th style="border:1px solid currentColor; padding:7px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">Metric (vs Q8, held-out code)</th>
<th style="border:1px solid currentColor; padding:7px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">No-imatrix</th>
<th style="border:1px solid currentColor; padding:7px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">Imatrix</th>
<th style="border:1px solid currentColor; padding:7px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">Change</th>
</tr></thead>
<tbody>
<tr><td style="border:1px solid currentColor; padding:7px 10px;"><b>Median KLD</b></td><td style="border:1px solid currentColor; padding:7px 10px;">0.00597</td><td style="border:1px solid currentColor; padding:7px 10px; font-weight:700;">0.00478</td><td style="border:1px solid currentColor; padding:7px 10px; font-weight:700;">βˆ’20%</td></tr>
<tr><td style="border:1px solid currentColor; padding:7px 10px;">90th-pct KLD</td><td style="border:1px solid currentColor; padding:7px 10px;">0.1342</td><td style="border:1px solid currentColor; padding:7px 10px;">0.1083</td><td style="border:1px solid currentColor; padding:7px 10px;">βˆ’19%</td></tr>
<tr><td style="border:1px solid currentColor; padding:7px 10px;"><b>RMS Ξ”p</b></td><td style="border:1px solid currentColor; padding:7px 10px;">8.14%</td><td style="border:1px solid currentColor; padding:7px 10px; font-weight:700;">7.36%</td><td style="border:1px solid currentColor; padding:7px 10px; font-weight:700;">βˆ’10%</td></tr>
<tr><td style="border:1px solid currentColor; padding:7px 10px;"><b>Same top token as Q8</b></td><td style="border:1px solid currentColor; padding:7px 10px;">91.01%</td><td style="border:1px solid currentColor; padding:7px 10px; font-weight:700;">91.49%</td><td style="border:1px solid currentColor; padding:7px 10px; font-weight:700;">+0.48 pp</td></tr>
<tr><td style="border:1px solid currentColor; padding:7px 10px;">Mean PPL</td><td style="border:1px solid currentColor; padding:7px 10px;">3.4556</td><td style="border:1px solid currentColor; padding:7px 10px;">3.4686</td><td style="border:1px solid currentColor; padding:7px 10px;">+0.013 (within Β±0.077 noise β€” a wash)</td></tr>
</tbody>
</table>
</div>
So the imatrix **measurably improves quantization fidelity to the full model on code** (median KL **βˆ’20%**, the gold-standard metric), at **zero cost** (same size/speed). PPL is a statistical wash. Honest scope: this is a fidelity-vs-Q8 measurement on ~20 K tokens of held-out code, **not** an absolute coding benchmark.
<div style="border:1px solid currentColor; padding:8px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px; margin:12px 0; opacity:0.85;">
<b>NOTE //</b> On "dual imatrix": a plain merge of two imatrices is mathematically identical to concatenating the corpora at the same ratio β€” the only real lever is the code:general ratio, which is what's set here. True size-decoupled balancing would need normalized-merge tooling; not used.
</div>
<div style="font-family:ui-monospace,'SF Mono',Consolas,monospace; font-weight:800; font-size:14px; letter-spacing:2px; text-transform:uppercase; border-bottom:2px solid currentColor; padding-bottom:5px; margin:26px 0 12px;"><span style="color:#ea580c;">05</span> Β· BUILD (REPRODUCIBLE)</div>
```bash
# code-weighted imatrix on the Q8 (single pass; ratio = the real lever)
llama-imatrix -m Qwen3-Coder-Next-Q8_0.gguf -f code-weighted-calib.txt -o coder-next.imatrix -c 512 -ngl 999
# quant -> ROCmFP4 with the imatrix (Q8 embeddings) + Q6 output head β€” the β˜… file (Β§01)
# fast single-scale body; --output-tensor-type q6_K raises the output head to Q6_K
llama-quantize --allow-requantize --token-embedding-type q8_0 --output-tensor-type q6_K --imatrix coder-next.imatrix \
Qwen3-Coder-Next-Q8_0.gguf Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf Q4_0_ROCMFP4_STRIX
```
> Experimental research build for AMD Strix Halo β€” hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.
<div style="font-family:ui-monospace,'SF Mono',Consolas,monospace; font-weight:800; font-size:14px; letter-spacing:2px; text-transform:uppercase; border-bottom:2px solid currentColor; padding-bottom:5px; margin:26px 0 12px;"><span style="color:#ea580c;">06</span> Β· LINEAGE &amp; CREDITS</div>
<div style="overflow:hidden; border-radius:0;">
<table style="width:100%; border-collapse:collapse; border-radius:0; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12.5px;">
<tbody>
<tr><td style="border:1px solid currentColor; padding:8px 11px; width:26%;">BASE MODEL</td><td style="border:1px solid currentColor; padding:8px 11px;"><a href="https://huggingface.co/Qwen/Qwen3-Coder-Next">Qwen/Qwen3-Coder-Next</a> (Apache-2.0, Qwen team) Β· 80B-A3B MoE, arch <code>qwen3next</code></td></tr>
<tr><td style="border:1px solid currentColor; padding:8px 11px;">CALIBRATION</td><td style="border:1px solid currentColor; padding:8px 11px;"><a href="https://huggingface.co/datasets/eaddario/imatrix-calibration">eaddario/imatrix-calibration</a> (code) Β· Kalomaze <code>groups_merged</code> via <a href="https://huggingface.co/datasets/froggeric/imatrix">froggeric/imatrix</a> (general)</td></tr>
<tr><td style="border:1px solid currentColor; padding:8px 11px;">FORMAT + RUNTIME</td><td style="border:1px solid currentColor; padding:8px 11px;"><a href="https://github.com/charlie12345/ROCmFPX">charlie12345/ROCmFPX</a> (based on llama.cpp, MIT)</td></tr>
</tbody>
</table>
</div>
*Derivative quantization β€” verify the base model's license before redistribution / use.*