GGUF
English
rocmfp4
qwen3next
qwen3-coder-next
coder
Mixture of Experts
imatrix
strix-halo
amd
rocm
vulkan
conversational
Instructions to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF", filename="Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF # Run inference directly in the terminal: llama cli -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF # Run inference directly in the terminal: llama cli -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF # Run inference directly in the terminal: ./llama-cli -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF # Run inference directly in the terminal: ./build/bin/llama-cli -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
Use Docker
docker model run hf.co/plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
- LM Studio
- Jan
- Ollama
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Ollama:
ollama run hf.co/plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
- Unsloth Studio
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF to start chatting
- Pi
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
Run Hermes
hermes
- Atomic Chat new
- OpenClaw new
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with OpenClaw:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
Configure OpenClaw
# Install OpenClaw: npm install -g openclaw@latest # Register the local server and set it as the default model: openclaw onboard --non-interactive --mode local \ --auth-choice custom-api-key \ --custom-base-url http://127.0.0.1:8080/v1 \ --custom-model-id "plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF" \ --custom-provider-id llama-cpp \ --custom-compatibility openai \ --custom-text-input \ --accept-risk \ --skip-health
Run OpenClaw
openclaw agent --local --agent main --message "Hello from Hugging Face"
- Docker Model Runner
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Docker Model Runner:
docker model run hf.co/plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
- Lemonade
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
Run and chat with the model
lemonade run user.Qwen3-Coder-Next-ROCmFP4-GGUF-{{QUANT_TAG}}List all available models
lemonade list
| base_model: Qwen/Qwen3-Coder-Next | |
| license: apache-2.0 | |
| library_name: gguf | |
| tags: | |
| - gguf | |
| - rocmfp4 | |
| - qwen3next | |
| - qwen3-coder-next | |
| - coder | |
| - moe | |
| - imatrix | |
| - strix-halo | |
| - amd | |
| - rocm | |
| - vulkan | |
| language: | |
| - en | |
| base_model_relation: quantized | |
| <div style="border:2px solid currentColor; font-family:ui-monospace,'SF Mono','Cascadia Mono',Consolas,'Liberation Mono',monospace;"> | |
| <div style="border-bottom:1px solid currentColor; padding:6px 12px; font-size:11px; letter-spacing:3px; text-transform:uppercase; opacity:0.7; text-align:center;">PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO Β· gfx1151</div> | |
| <div style="padding:14px; display:flex; flex-wrap:wrap; align-items:center; justify-content:center; gap:18px;"> | |
| <pre style="margin:0; flex:0 0 auto; font-family:ui-monospace,'SF Mono','Cascadia Mono',Consolas,monospace; font-size:5px; line-height:1.1; letter-spacing:0;"> | |
| βββββββββ | |
| βββββββββββ | |
| ββ ββββββββββββββββββ | |
| ββ ββββββββββββββββββββ | |
| βββββββ ββββββββββββββββββββββ | |
| ββββ ββ ββββββββββββββββββββββ | |
| ββββββ ββ βββ | |
| βββββββ βββββββββββββββββββββββ | |
| βββββββ ββββββββββββββ ββ | |
| βββββββ ββββββββββββ β ββ | |
| ββββββββ ββββββββββ βββ ββ | |
| ββββββββ ββββββββ βββββββββββ | |
| βββββββββ βββββ βββββββββββββ | |
| βββββββββββ ββ βββββββββββββ | |
| βββββββ ββ βββββββββββββ | |
| ββββ ββ ββββββββ | |
| βββββββββββββ βββββββ | |
| βββ βββββββ | |
| βββββββββ | |
| </pre> | |
| <div style="flex:0 1 auto; max-width:100%; text-align:center;"> | |
| <div style="font-size:23px; font-weight:800; letter-spacing:1px;">QWEN3-CODER-NEXT</div> | |
| <div style="font-size:12.5px; letter-spacing:1px; opacity:0.8; margin-top:5px;"><span style="white-space:nowrap;">4-BIT ROCmFP4</span> Β· <span style="white-space:nowrap;">80B-A3B MoE</span> Β· <span style="white-space:nowrap;">CODE-WEIGHTED IMATRIX</span> Β· <span style="white-space:nowrap;">AGENTIC CODER</span> Β· <span style="white-space:nowrap;">SINGLE AMD APU</span></div> | |
| </div> | |
| </div> | |
| <table style="display:table; table-layout:fixed; width:100%; margin:0; border-collapse:collapse; border-radius:0; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px;"> | |
| <tr> | |
| <td style="border-top:1px solid currentColor; border-right:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">FORMAT</div><div style="font-weight:700;">ROCmFP4 4-BIT</div></td> | |
| <td style="border-top:1px solid currentColor; border-right:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">PRECISION</div><div style="font-weight:700;">~4.5 BPW</div></td> | |
| <td style="border-top:1px solid currentColor; border-right:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">ARCH</div><div style="font-weight:700;">QWEN3NEXT</div></td> | |
| <td style="border-top:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">CONTEXT</div><div style="font-weight:700;">262 K</div></td> | |
| </tr> | |
| <tr> | |
| <td style="border-top:1px solid currentColor; border-right:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">PARAMS</div><div style="font-weight:700;">80B Β· A3B MoE</div></td> | |
| <td style="border-top:1px solid currentColor; border-right:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">DRAFT</div><div style="font-weight:700;">NO MTP</div></td> | |
| <td style="border-top:1px solid currentColor; border-right:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">BACKEND</div><div style="font-weight:700;">VULKAN0</div></td> | |
| <td style="border-top:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">LICENSE</div><div style="font-weight:700;">APACHE-2.0</div></td> | |
| </tr> | |
| </table> | |
| </div> | |
| <div style="border:2px solid #dc2626; padding:10px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12.5px; margin:14px 0;"> | |
| <b style="color:#dc2626; letter-spacing:1px;">β REQUIRES THE ROCmFP4 FORK</b><br> | |
| The custom <code>q4_0_rocmfp4</code> / <code>q4_0_rocmfp4_fast</code> tensor types <b>will not load in stock llama.cpp, LM Studio, or Ollama</b>. Build/run with <a href="https://github.com/charlie12345/ROCmFPX">charlie12345/ROCmFPX</a> Β· branch <code>mtp-rocmfp4-strix</code>. | |
| </div> | |
| <div style="border:1px solid currentColor; padding:8px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px; margin:14px 0; opacity:0.85;"> | |
| <b>NOTE //</b> Ignore HuggingFace's auto-detected "F16"/16-bit badge β its parser can't read ROCmFP4 and mislabels the file. These are <b>~4.5 bpw 4-bit</b> ROCmFP4 files; pick by filename in <i>Files and versions</i>. | |
| </div> | |
| Experimental **AMD Strix Halo (gfx1151)** quant of [**Qwen3-Coder-Next**](https://huggingface.co/Qwen/Qwen3-Coder-Next) β Qwen's agentic coding model (**80B total / 3B active** high-sparsity MoE, hybrid Gated-DeltaNet attention, arch `qwen3next`, 262K context) β in the custom **ROCmFP4** 4-bit format, **imatrix-quantized** with a code-weighted importance matrix. | |
| <div style="font-family:ui-monospace,'SF Mono',Consolas,monospace; font-weight:800; font-size:14px; letter-spacing:2px; text-transform:uppercase; border-bottom:2px solid currentColor; padding-bottom:5px; margin:26px 0 12px;"><span style="color:#ea580c;">01</span> Β· FILES</div> | |
| <div style="overflow:hidden; border-radius:0;"> | |
| <table style="width:100%; border-collapse:collapse; border-radius:0; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12.5px;"> | |
| <thead><tr> | |
| <th style="border:1px solid currentColor; padding:7px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">File</th> | |
| <th style="border:1px solid currentColor; padding:7px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">Output head</th> | |
| <th style="border:1px solid currentColor; padding:7px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">Pick if</th> | |
| </tr></thead> | |
| <tbody> | |
| <tr><td style="border:1px solid currentColor; padding:7px 10px;"><code>β¦-STRIX-embQ8-imatrix-headQ6.gguf</code> β </td><td style="border:1px solid currentColor; padding:7px 10px;">Q6_K</td><td style="border:1px solid currentColor; padding:7px 10px;"><b>the one build</b> β best speed/quality balance: Q8 embeddings + Q6 output head on the fast single-scale body</td></tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| One file β the **best speed/quality balance** in ROCmFP4 for Strix Halo. It keeps the two quality levers that are actually *felt* β **Q8 token embeddings** (matching the Q8 source exactly) and a **Q6_K output head** β on the fast single-scale `q4_0_rocmfp4_fast` body + a code-weighted imatrix. Not the most faithful possible (see the fidelity link in Β§04) β it's the point where speed and quality meet best. The DeltaNet-specific tensors (`ssm_conv1d`, `ssm_a`, norms, router) stay **F32**; MoE experts + attention/SSM projections are 4-bit ROCmFP4. | |
| <div style="border:1px solid currentColor; padding:8px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px; margin:12px 0; opacity:0.85;"> | |
| <b>NOTE //</b> <b>Q8 embeddings</b> (not f16): the source is Q8_0, so Q8 matches its precision exactly β f16 would be fake-f16 bloat for zero gain (embeddings are a lookup, not a matmul). | |
| </div> | |
| <div style="font-family:ui-monospace,'SF Mono',Consolas,monospace; font-weight:800; font-size:14px; letter-spacing:2px; text-transform:uppercase; border-bottom:2px solid currentColor; padding-bottom:5px; margin:26px 0 12px;"><span style="color:#ea580c;">02</span> Β· QUICK START</div> | |
| Run from the folder holding the `.gguf` (the Qwen ChatML template is baked in β just pass `--jinja`): | |
| ```bash | |
| env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \ | |
| llama-server \ | |
| -m Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf \ | |
| --alias coder-next \ | |
| --host 0.0.0.0 \ | |
| --port 8080 \ | |
| -c 262144 \ | |
| -ctk q8_0 \ | |
| -ctv q8_0 \ | |
| --temp 0.7 \ | |
| --top-p 0.8 \ | |
| --top-k 20 \ | |
| -dev Vulkan0 \ | |
| -ngl 999 \ | |
| -fa on \ | |
| -b 2048 \ | |
| -ub 256 \ | |
| -t 16 \ | |
| -tb 16 \ | |
| -cpent 256 \ | |
| -ctxcp 32 \ | |
| --cache-reuse 256 \ | |
| --cache-ram 65536 \ | |
| --jinja \ | |
| --parallel 1 \ | |
| --metrics \ | |
| --no-mmap | |
| ``` | |
| <div style="overflow:hidden; border-radius:0;"> | |
| <table style="width:100%; border-collapse:collapse; border-radius:0; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px;"> | |
| <thead><tr> | |
| <th style="border:1px solid currentColor; padding:6px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px; width:40%;">Flag</th> | |
| <th style="border:1px solid currentColor; padding:6px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">Function</th> | |
| </tr></thead> | |
| <tbody> | |
| <tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>HSA_OVERRIDE_GFX_VERSION=11.5.1</code></td><td style="border:1px solid currentColor; padding:6px 10px;">treat the APU as gfx1151 (Strix Halo)</td></tr> | |
| <tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>GGML_HIP_ENABLE_UNIFIED_MEMORY=1</code></td><td style="border:1px solid currentColor; padding:6px 10px;">allow use of the full 128 GB unified memory</td></tr> | |
| <tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>-dev Vulkan0</code></td><td style="border:1px solid currentColor; padding:6px 10px;">run on Vulkan β fastest backend for ROCmFP4 on Strix Halo</td></tr> | |
| <tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>-ngl 999 Β· -fa on</code></td><td style="border:1px solid currentColor; padding:6px 10px;">offload all layers Β· flash attention</td></tr> | |
| <tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>-c 262144</code></td><td style="border:1px solid currentColor; padding:6px 10px;">context length (256K)</td></tr> | |
| <tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>-b 2048 Β· -ub 256 Β· -t/-tb 16</code></td><td style="border:1px solid currentColor; padding:6px 10px;">prefill batch / micro-batch Β· CPU threads</td></tr> | |
| <tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>-ctk q8_0 Β· -ctv q8_0</code></td><td style="border:1px solid currentColor; padding:6px 10px;">q8_0 (8-bit) KV cache β how we run it; drop to <code>q4_0</code> to use less memory, or raise to <code>f16</code></td></tr> | |
| <tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>-cpent Β· -ctxcp Β· --cache-reuse Β· --cache-ram 65536</code></td><td style="border:1px solid currentColor; padding:6px 10px;">cross-turn KV checkpointing + 64 GB resident reuse cache</td></tr> | |
| <tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>--temp 0.7 --top-p 0.8 --top-k 20</code></td><td style="border:1px solid currentColor; padding:6px 10px;">Qwen-Coder recommended sampling</td></tr> | |
| <tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>--jinja --parallel 1 --metrics --no-mmap</code></td><td style="border:1px solid currentColor; padding:6px 10px;">apply baked ChatML template Β· single slot Β· metrics Β· weights in RAM</td></tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| <div style="border:1px solid currentColor; padding:8px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px; margin:12px 0; opacity:0.85;"> | |
| <b>NOTE //</b> No <code>--spec-*</code> / <code>--spec-type draft-mtp</code> flags β this arch has <b>no MTP head</b> (see Β§04). It's already fast on its own. | |
| </div> | |
| <div style="font-family:ui-monospace,'SF Mono',Consolas,monospace; font-weight:800; font-size:14px; letter-spacing:2px; text-transform:uppercase; border-bottom:2px solid currentColor; padding-bottom:5px; margin:26px 0 12px;"><span style="color:#ea580c;">03</span> Β· AGENTIC CODING / TOOLS</div> | |
| Qwen3-Coder-Next is an **agentic coder** β built to call tools, not narrate code. To wire it up: | |
| - **Chat template:** Qwen (ChatML) is baked into the GGUF β just pass `--jinja` and your client applies it automatically. | |
| - **Tool calling:** enable the **`qwen3_coder`** tool-call parser in your client (e.g. the matching parser flag in llama-server / your agent harness). Without it, native tool calls won't be parsed and the model tends to narrate code instead of calling tools. | |
| - **Sampling:** temp `0.7`, top-p `0.8`, top-k `20` (Qwen-Coder recommended) β already set in Β§02. | |
| <div style="border:1px solid currentColor; padding:8px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px; margin:12px 0; opacity:0.85;"> | |
| <b>NOTE //</b> The cross-turn reuse cache (<code>--cache-reuse</code> / <code>--cache-ram</code>) keeps long agentic sessions cheap β the leading prompt isn't re-prefilled every turn. | |
| </div> | |
| <div style="font-family:ui-monospace,'SF Mono',Consolas,monospace; font-weight:800; font-size:14px; letter-spacing:2px; text-transform:uppercase; border-bottom:2px solid currentColor; padding-bottom:5px; margin:26px 0 12px;"><span style="color:#ea580c;">04</span> Β· PERFORMANCE & QUALITY</div> | |
| <div style="overflow:hidden; border-radius:0;"> | |
| <table style="width:100%; border-collapse:collapse; border-radius:0; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12.5px;"> | |
| <tbody> | |
| <tr><td style="border:1px solid currentColor; padding:8px 11px; width:42%;">DECODE Β· short context</td><td style="border:1px solid currentColor; padding:8px 11px; font-weight:700;">~54 t/s (Vulkan / Ryzen AI Max+ 395)</td></tr> | |
| <tr><td style="border:1px solid currentColor; padding:8px 11px;">SPECULATIVE DECODE</td><td style="border:1px solid currentColor; padding:8px 11px; font-weight:700;">none (no MTP head)</td></tr> | |
| <tr><td style="border:1px solid currentColor; padding:8px 11px;">LONG CONTEXT</td><td style="border:1px solid currentColor; padding:8px 11px;">cheap β DeltaNet near-constant memory</td></tr> | |
| <tr><td style="border:1px solid currentColor; padding:8px 11px;">QUANTIZATION</td><td style="border:1px solid currentColor; padding:8px 11px;">fast single-scale body + Q8 emb + Q6 head + code-weighted imatrix (measured win β below)</td></tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| **This is the best speed/quality balance in ROCmFP4 β by design, not the absolute fastest.** On top of the imatrix + Q8 emb + Q6 head, we swept the body kernel against the Q8 source by **KL divergence** (the right fidelity metric). An all-dual-scale body did edge the fast single-scale body on KL, but the gain sat inside the measurement noise while costing decode speed β so the **fast single-scale body + Q8 embeddings + Q6 head** is the right point, and the one file we ship. | |
| This mirrors the fuller sweep on our [**Qwen3.6-27B sibling**](https://huggingface.co/plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF), where every higher-precision body lever (all-dual-scale, selective Q5/Q6 bumps) bought a KL improvement inside the noise at a real speed cost β and where copying an entire dynamic-quant high-precision allocation onto ROCmFP4 *still* couldn't match a true dynamic K-quant, because FP4 is intrinsically less faithful than Q4_K's 4-bit. The same format limit applies here: within ROCmFP4, fast body + Q8 emb + Q6 head is the optimal balance; for maximum fidelity reach for a dynamic K-quant of the base (box below). *(Directional internal measurements β KL vs Q8 on held-out code; reproduce before citing.)* | |
| <div style="border:1px solid currentColor; padding:8px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px; margin:12px 0; opacity:0.9;"> | |
| <b>WANT MAXIMUM FIDELITY INSTEAD OF SPEED?</b> Grab a <b>Q6_K / Q8 dynamic GGUF of the base</b> from <a href="https://huggingface.co/Qwen/Qwen3-Coder-Next"><b>Qwen/Qwen3-Coder-Next</b></a> β higher-bit GGUFs run on this same fork. We optimize for throughput in ROCmFP4; if you want the last bit of fidelity over speed, that's the one to grab. | |
| </div> | |
| **Fast even without speculative decoding.** 3B active params + linear Gated-DeltaNet attention β ~54 t/s short-context decode on a Ryzen AI Max+ 395 (Vulkan0), and cheap long context. No MTP needed. | |
| <div style="border:1px solid currentColor; padding:8px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px; margin:12px 0; opacity:0.85;"> | |
| <b>NOTE // NO MTP</b> Qwen3-Coder-Next ships <b>without</b> an MTP head, and the ROCmFP4 fork currently wires MTP drafting only for the <code>qwen35</code>/<code>qwen35moe</code> archs, <b>not</b> <code>qwen3next</code>. So these are <b>no-MTP</b> (non-speculative) builds β in practice it doesn't matter, it's fast on its own. | |
| </div> | |
| **The imatrix β code-weighted, and measured (a clean win here).** Quantized **with an importance matrix** built from a **code-weighted** calibration mix (~2.6:1 code:general): real multi-language source + code-analysis prompts from [`eaddario/imatrix-calibration`](https://huggingface.co/datasets/eaddario/imatrix-calibration), plus Kalomaze's `groups_merged` (via [`froggeric/imatrix`](https://huggingface.co/datasets/froggeric/imatrix)) for general. | |
| KL-divergence + perplexity vs the **Q8 reference** on a **held-out code** slice (disjoint from calibration), imatrix vs no-imatrix: | |
| <div style="overflow:hidden; border-radius:0;"> | |
| <table style="width:100%; border-collapse:collapse; border-radius:0; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12.5px;"> | |
| <thead><tr> | |
| <th style="border:1px solid currentColor; padding:7px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">Metric (vs Q8, held-out code)</th> | |
| <th style="border:1px solid currentColor; padding:7px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">No-imatrix</th> | |
| <th style="border:1px solid currentColor; padding:7px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">Imatrix</th> | |
| <th style="border:1px solid currentColor; padding:7px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">Change</th> | |
| </tr></thead> | |
| <tbody> | |
| <tr><td style="border:1px solid currentColor; padding:7px 10px;"><b>Median KLD</b></td><td style="border:1px solid currentColor; padding:7px 10px;">0.00597</td><td style="border:1px solid currentColor; padding:7px 10px; font-weight:700;">0.00478</td><td style="border:1px solid currentColor; padding:7px 10px; font-weight:700;">β20%</td></tr> | |
| <tr><td style="border:1px solid currentColor; padding:7px 10px;">90th-pct KLD</td><td style="border:1px solid currentColor; padding:7px 10px;">0.1342</td><td style="border:1px solid currentColor; padding:7px 10px;">0.1083</td><td style="border:1px solid currentColor; padding:7px 10px;">β19%</td></tr> | |
| <tr><td style="border:1px solid currentColor; padding:7px 10px;"><b>RMS Ξp</b></td><td style="border:1px solid currentColor; padding:7px 10px;">8.14%</td><td style="border:1px solid currentColor; padding:7px 10px; font-weight:700;">7.36%</td><td style="border:1px solid currentColor; padding:7px 10px; font-weight:700;">β10%</td></tr> | |
| <tr><td style="border:1px solid currentColor; padding:7px 10px;"><b>Same top token as Q8</b></td><td style="border:1px solid currentColor; padding:7px 10px;">91.01%</td><td style="border:1px solid currentColor; padding:7px 10px; font-weight:700;">91.49%</td><td style="border:1px solid currentColor; padding:7px 10px; font-weight:700;">+0.48 pp</td></tr> | |
| <tr><td style="border:1px solid currentColor; padding:7px 10px;">Mean PPL</td><td style="border:1px solid currentColor; padding:7px 10px;">3.4556</td><td style="border:1px solid currentColor; padding:7px 10px;">3.4686</td><td style="border:1px solid currentColor; padding:7px 10px;">+0.013 (within Β±0.077 noise β a wash)</td></tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| So the imatrix **measurably improves quantization fidelity to the full model on code** (median KL **β20%**, the gold-standard metric), at **zero cost** (same size/speed). PPL is a statistical wash. Honest scope: this is a fidelity-vs-Q8 measurement on ~20 K tokens of held-out code, **not** an absolute coding benchmark. | |
| <div style="border:1px solid currentColor; padding:8px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px; margin:12px 0; opacity:0.85;"> | |
| <b>NOTE //</b> On "dual imatrix": a plain merge of two imatrices is mathematically identical to concatenating the corpora at the same ratio β the only real lever is the code:general ratio, which is what's set here. True size-decoupled balancing would need normalized-merge tooling; not used. | |
| </div> | |
| <div style="font-family:ui-monospace,'SF Mono',Consolas,monospace; font-weight:800; font-size:14px; letter-spacing:2px; text-transform:uppercase; border-bottom:2px solid currentColor; padding-bottom:5px; margin:26px 0 12px;"><span style="color:#ea580c;">05</span> Β· BUILD (REPRODUCIBLE)</div> | |
| ```bash | |
| # code-weighted imatrix on the Q8 (single pass; ratio = the real lever) | |
| llama-imatrix -m Qwen3-Coder-Next-Q8_0.gguf -f code-weighted-calib.txt -o coder-next.imatrix -c 512 -ngl 999 | |
| # quant -> ROCmFP4 with the imatrix (Q8 embeddings) + Q6 output head β the β file (Β§01) | |
| # fast single-scale body; --output-tensor-type q6_K raises the output head to Q6_K | |
| llama-quantize --allow-requantize --token-embedding-type q8_0 --output-tensor-type q6_K --imatrix coder-next.imatrix \ | |
| Qwen3-Coder-Next-Q8_0.gguf Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf Q4_0_ROCMFP4_STRIX | |
| ``` | |
| > Experimental research build for AMD Strix Halo β hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution. | |
| <div style="font-family:ui-monospace,'SF Mono',Consolas,monospace; font-weight:800; font-size:14px; letter-spacing:2px; text-transform:uppercase; border-bottom:2px solid currentColor; padding-bottom:5px; margin:26px 0 12px;"><span style="color:#ea580c;">06</span> Β· LINEAGE & CREDITS</div> | |
| <div style="overflow:hidden; border-radius:0;"> | |
| <table style="width:100%; border-collapse:collapse; border-radius:0; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12.5px;"> | |
| <tbody> | |
| <tr><td style="border:1px solid currentColor; padding:8px 11px; width:26%;">BASE MODEL</td><td style="border:1px solid currentColor; padding:8px 11px;"><a href="https://huggingface.co/Qwen/Qwen3-Coder-Next">Qwen/Qwen3-Coder-Next</a> (Apache-2.0, Qwen team) Β· 80B-A3B MoE, arch <code>qwen3next</code></td></tr> | |
| <tr><td style="border:1px solid currentColor; padding:8px 11px;">CALIBRATION</td><td style="border:1px solid currentColor; padding:8px 11px;"><a href="https://huggingface.co/datasets/eaddario/imatrix-calibration">eaddario/imatrix-calibration</a> (code) Β· Kalomaze <code>groups_merged</code> via <a href="https://huggingface.co/datasets/froggeric/imatrix">froggeric/imatrix</a> (general)</td></tr> | |
| <tr><td style="border:1px solid currentColor; padding:8px 11px;">FORMAT + RUNTIME</td><td style="border:1px solid currentColor; padding:8px 11px;"><a href="https://github.com/charlie12345/ROCmFPX">charlie12345/ROCmFPX</a> (based on llama.cpp, MIT)</td></tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| *Derivative quantization β verify the base model's license before redistribution / use.* | |