--- base_model: Qwen/Qwen3-Coder-Next license: apache-2.0 library_name: gguf tags: - gguf - rocmfp4 - qwen3next - qwen3-coder-next - coder - moe - imatrix - strix-halo - amd - rocm - vulkan language: - en base_model_relation: quantized ---
PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO · gfx1151
            ▗▇▇▇▇▇▇▇▖                 
           ▗█▘▝██████▖                
          ▗▛   ▝██████▆▆▆▆▆▆▆▆▆▆▅     
         ▟▛    ▗█████████████████▙▖   
   ▄▄▄▄▄▟▛    ▟████████████████████▖  
 ▗██▌    ▚▖   ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘  
▗████▖    ▜▖                    ▗█▘   
▜█████▙    ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙    
 ▜█████▙    ▝████████████▛       ▜▙   
  ▜█████▙    ▝██████████▛    ▃    ▜▙  
   ▀█████▙▖   ▝████████▘    ▟█▙    ▀▙ 
    ▝██████▖   ▝▜█████▘    ▟███▙▂▂▂▂▐█
    ▟███████▖    ▜███▘   ▗███████████▛
   ▟█████████▄    ▜▛    ▗███████████▀ 
  ▝█████▀        ▗▛    ▗██████▀▀▀▀▀▘  
    ▜██▘        ▗▛    ▟█████▛▘        
     ▜█▇▇▇▇▇▇▇▇▇█▖   ▟█████▛          
                ▝█▖ ▟█████▛           
                 ▝███████▀            
QWEN3-CODER-NEXT
4-BIT ROCmFP4 · 80B-A3B MoE · CODE-WEIGHTED IMATRIX · AGENTIC CODER · SINGLE AMD APU
FORMAT
ROCmFP4 4-BIT
PRECISION
~4.5 BPW
ARCH
QWEN3NEXT
CONTEXT
262 K
PARAMS
80B · A3B MoE
DRAFT
NO MTP
BACKEND
VULKAN0
LICENSE
APACHE-2.0
⚠ REQUIRES THE ROCmFP4 FORK
The custom q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, or Ollama. Build/run with charlie12345/ROCmFPX · branch mtp-rocmfp4-strix.
NOTE // Ignore HuggingFace's auto-detected "F16"/16-bit badge — its parser can't read ROCmFP4 and mislabels the file. These are ~4.5 bpw 4-bit ROCmFP4 files; pick by filename in Files and versions.
Experimental **AMD Strix Halo (gfx1151)** quant of [**Qwen3-Coder-Next**](https://huggingface.co/Qwen/Qwen3-Coder-Next) — Qwen's agentic coding model (**80B total / 3B active** high-sparsity MoE, hybrid Gated-DeltaNet attention, arch `qwen3next`, 262K context) — in the custom **ROCmFP4** 4-bit format, **imatrix-quantized** with a code-weighted importance matrix.
01 · FILES
File Output head Pick if
…-STRIX-embQ8-imatrix-headQ6.ggufQ6_Kthe one build — best speed/quality balance: Q8 embeddings + Q6 output head on the fast single-scale body
One file — the **best speed/quality balance** in ROCmFP4 for Strix Halo. It keeps the two quality levers that are actually *felt* — **Q8 token embeddings** (matching the Q8 source exactly) and a **Q6_K output head** — on the fast single-scale `q4_0_rocmfp4_fast` body + a code-weighted imatrix. Not the most faithful possible (see the fidelity link in §04) — it's the point where speed and quality meet best. The DeltaNet-specific tensors (`ssm_conv1d`, `ssm_a`, norms, router) stay **F32**; MoE experts + attention/SSM projections are 4-bit ROCmFP4.
NOTE // Q8 embeddings (not f16): the source is Q8_0, so Q8 matches its precision exactly — f16 would be fake-f16 bloat for zero gain (embeddings are a lookup, not a matmul).
02 · QUICK START
Run from the folder holding the `.gguf` (the Qwen ChatML template is baked in — just pass `--jinja`): ```bash env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \ llama-server \ -m Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf \ --alias coder-next \ --host 0.0.0.0 \ --port 8080 \ -c 262144 \ -ctk q8_0 \ -ctv q8_0 \ --temp 0.7 \ --top-p 0.8 \ --top-k 20 \ -dev Vulkan0 \ -ngl 999 \ -fa on \ -b 2048 \ -ub 256 \ -t 16 \ -tb 16 \ -cpent 256 \ -ctxcp 32 \ --cache-reuse 256 \ --cache-ram 65536 \ --jinja \ --parallel 1 \ --metrics \ --no-mmap ```
Flag Function
HSA_OVERRIDE_GFX_VERSION=11.5.1treat the APU as gfx1151 (Strix Halo)
GGML_HIP_ENABLE_UNIFIED_MEMORY=1allow use of the full 128 GB unified memory
-dev Vulkan0run on Vulkan — fastest backend for ROCmFP4 on Strix Halo
-ngl 999 · -fa onoffload all layers · flash attention
-c 262144context length (256K)
-b 2048 · -ub 256 · -t/-tb 16prefill batch / micro-batch · CPU threads
-ctk q8_0 · -ctv q8_0q8_0 (8-bit) KV cache — how we run it; drop to q4_0 to use less memory, or raise to f16
-cpent · -ctxcp · --cache-reuse · --cache-ram 65536cross-turn KV checkpointing + 64 GB resident reuse cache
--temp 0.7 --top-p 0.8 --top-k 20Qwen-Coder recommended sampling
--jinja --parallel 1 --metrics --no-mmapapply baked ChatML template · single slot · metrics · weights in RAM
NOTE // No --spec-* / --spec-type draft-mtp flags — this arch has no MTP head (see §04). It's already fast on its own.
03 · AGENTIC CODING / TOOLS
Qwen3-Coder-Next is an **agentic coder** — built to call tools, not narrate code. To wire it up: - **Chat template:** Qwen (ChatML) is baked into the GGUF — just pass `--jinja` and your client applies it automatically. - **Tool calling:** enable the **`qwen3_coder`** tool-call parser in your client (e.g. the matching parser flag in llama-server / your agent harness). Without it, native tool calls won't be parsed and the model tends to narrate code instead of calling tools. - **Sampling:** temp `0.7`, top-p `0.8`, top-k `20` (Qwen-Coder recommended) — already set in §02.
NOTE // The cross-turn reuse cache (--cache-reuse / --cache-ram) keeps long agentic sessions cheap — the leading prompt isn't re-prefilled every turn.
04 · PERFORMANCE & QUALITY
DECODE · short context~54 t/s (Vulkan / Ryzen AI Max+ 395)
SPECULATIVE DECODEnone (no MTP head)
LONG CONTEXTcheap — DeltaNet near-constant memory
QUANTIZATIONfast single-scale body + Q8 emb + Q6 head + code-weighted imatrix (measured win — below)
**This is the best speed/quality balance in ROCmFP4 — by design, not the absolute fastest.** On top of the imatrix + Q8 emb + Q6 head, we swept the body kernel against the Q8 source by **KL divergence** (the right fidelity metric). An all-dual-scale body did edge the fast single-scale body on KL, but the gain sat inside the measurement noise while costing decode speed — so the **fast single-scale body + Q8 embeddings + Q6 head** is the right point, and the one file we ship. This mirrors the fuller sweep on our [**Qwen3.6-27B sibling**](https://huggingface.co/plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF), where every higher-precision body lever (all-dual-scale, selective Q5/Q6 bumps) bought a KL improvement inside the noise at a real speed cost — and where copying an entire dynamic-quant high-precision allocation onto ROCmFP4 *still* couldn't match a true dynamic K-quant, because FP4 is intrinsically less faithful than Q4_K's 4-bit. The same format limit applies here: within ROCmFP4, fast body + Q8 emb + Q6 head is the optimal balance; for maximum fidelity reach for a dynamic K-quant of the base (box below). *(Directional internal measurements — KL vs Q8 on held-out code; reproduce before citing.)*
WANT MAXIMUM FIDELITY INSTEAD OF SPEED? Grab a Q6_K / Q8 dynamic GGUF of the base from Qwen/Qwen3-Coder-Next — higher-bit GGUFs run on this same fork. We optimize for throughput in ROCmFP4; if you want the last bit of fidelity over speed, that's the one to grab.
**Fast even without speculative decoding.** 3B active params + linear Gated-DeltaNet attention → ~54 t/s short-context decode on a Ryzen AI Max+ 395 (Vulkan0), and cheap long context. No MTP needed.
NOTE // NO MTP Qwen3-Coder-Next ships without an MTP head, and the ROCmFP4 fork currently wires MTP drafting only for the qwen35/qwen35moe archs, not qwen3next. So these are no-MTP (non-speculative) builds — in practice it doesn't matter, it's fast on its own.
**The imatrix — code-weighted, and measured (a clean win here).** Quantized **with an importance matrix** built from a **code-weighted** calibration mix (~2.6:1 code:general): real multi-language source + code-analysis prompts from [`eaddario/imatrix-calibration`](https://huggingface.co/datasets/eaddario/imatrix-calibration), plus Kalomaze's `groups_merged` (via [`froggeric/imatrix`](https://huggingface.co/datasets/froggeric/imatrix)) for general. KL-divergence + perplexity vs the **Q8 reference** on a **held-out code** slice (disjoint from calibration), imatrix vs no-imatrix:
Metric (vs Q8, held-out code) No-imatrix Imatrix Change
Median KLD0.005970.00478−20%
90th-pct KLD0.13420.1083−19%
RMS Δp8.14%7.36%−10%
Same top token as Q891.01%91.49%+0.48 pp
Mean PPL3.45563.4686+0.013 (within ±0.077 noise — a wash)
So the imatrix **measurably improves quantization fidelity to the full model on code** (median KL **−20%**, the gold-standard metric), at **zero cost** (same size/speed). PPL is a statistical wash. Honest scope: this is a fidelity-vs-Q8 measurement on ~20 K tokens of held-out code, **not** an absolute coding benchmark.
NOTE // On "dual imatrix": a plain merge of two imatrices is mathematically identical to concatenating the corpora at the same ratio — the only real lever is the code:general ratio, which is what's set here. True size-decoupled balancing would need normalized-merge tooling; not used.
05 · BUILD (REPRODUCIBLE)
```bash # code-weighted imatrix on the Q8 (single pass; ratio = the real lever) llama-imatrix -m Qwen3-Coder-Next-Q8_0.gguf -f code-weighted-calib.txt -o coder-next.imatrix -c 512 -ngl 999 # quant -> ROCmFP4 with the imatrix (Q8 embeddings) + Q6 output head — the ★ file (§01) # fast single-scale body; --output-tensor-type q6_K raises the output head to Q6_K llama-quantize --allow-requantize --token-embedding-type q8_0 --output-tensor-type q6_K --imatrix coder-next.imatrix \ Qwen3-Coder-Next-Q8_0.gguf Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf Q4_0_ROCMFP4_STRIX ``` > Experimental research build for AMD Strix Halo — hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.
06 · LINEAGE & CREDITS
BASE MODELQwen/Qwen3-Coder-Next (Apache-2.0, Qwen team) · 80B-A3B MoE, arch qwen3next
CALIBRATIONeaddario/imatrix-calibration (code) · Kalomaze groups_merged via froggeric/imatrix (general)
FORMAT + RUNTIMEcharlie12345/ROCmFPX (based on llama.cpp, MIT)
*Derivative quantization — verify the base model's license before redistribution / use.*