Instructions to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF",
	filename="Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
# Run inference directly in the terminal:
llama cli -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
# Run inference directly in the terminal:
llama cli -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
# Run inference directly in the terminal:
./llama-cli -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Use Docker

docker model run hf.co/plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

LM Studio
Jan
Ollama
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Ollama:
```
ollama run hf.co/plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
```

Unsloth Studio

How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF to start chatting

How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Docker Model Runner:
```
docker model run hf.co/plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF
```

Lemonade

How to use plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull plunderstruck/Qwen3-Coder-Next-ROCmFP4-GGUF

Run and chat with the model

lemonade run user.Qwen3-Coder-Next-ROCmFP4-GGUF-{{QUANT_TAG}}

List all available models

lemonade list

Qwen3-Coder-Next-ROCmFP4-GGUF / README.md

plunderstruck

Repoint build instructions to charlie12345/ROCmFPX (ROCmFPX FP3/4/6/8 repo)

9acfd2c verified 12 days ago

preview code

Raw

History Blame Contribute Delete

24.4 kB

	---
	base_model: Qwen/Qwen3-Coder-Next
	license: apache-2.0
	library_name: gguf
	tags:
	- gguf
	- rocmfp4
	- qwen3next
	- qwen3-coder-next
	- coder
	- moe
	- imatrix
	- strix-halo
	- amd
	- rocm
	- vulkan
	language:
	- en
	base_model_relation: quantized
	---

	<div style="border:2px solid currentColor; font-family:ui-monospace,'SF Mono','Cascadia Mono',Consolas,'Liberation Mono',monospace;">
	<div style="border-bottom:1px solid currentColor; padding:6px 12px; font-size:11px; letter-spacing:3px; text-transform:uppercase; opacity:0.7; text-align:center;">PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO · gfx1151</div>
	<div style="padding:14px; display:flex; flex-wrap:wrap; align-items:center; justify-content:center; gap:18px;">
	<pre style="margin:0; flex:0 0 auto; font-family:ui-monospace,'SF Mono','Cascadia Mono',Consolas,monospace; font-size:5px; line-height:1.1; letter-spacing:0;">
	▗▇▇▇▇▇▇▇▖
	▗█▘▝██████▖
	▗▛ ▝██████▆▆▆▆▆▆▆▆▆▆▅
	▟▛ ▗█████████████████▙▖
	▄▄▄▄▄▟▛ ▟████████████████████▖
	▗██▌ ▚▖ ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘
	▗████▖ ▜▖ ▗█▘
	▜█████▙ ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙
	▜█████▙ ▝████████████▛ ▜▙
	▜█████▙ ▝██████████▛ ▃ ▜▙
	▀█████▙▖ ▝████████▘ ▟█▙ ▀▙
	▝██████▖ ▝▜█████▘ ▟███▙▂▂▂▂▐█
	▟███████▖ ▜███▘ ▗███████████▛
	▟█████████▄ ▜▛ ▗███████████▀
	▝█████▀ ▗▛ ▗██████▀▀▀▀▀▘
	▜██▘ ▗▛ ▟█████▛▘
	▜█▇▇▇▇▇▇▇▇▇█▖ ▟█████▛
	▝█▖ ▟█████▛
	▝███████▀
	</pre>
	<div style="flex:0 1 auto; max-width:100%; text-align:center;">
	<div style="font-size:23px; font-weight:800; letter-spacing:1px;">QWEN3-CODER-NEXT</div>
	<div style="font-size:12.5px; letter-spacing:1px; opacity:0.8; margin-top:5px;"><span style="white-space:nowrap;">4-BIT ROCmFP4</span> · <span style="white-space:nowrap;">80B-A3B MoE</span> · <span style="white-space:nowrap;">CODE-WEIGHTED IMATRIX</span> · <span style="white-space:nowrap;">AGENTIC CODER</span> · <span style="white-space:nowrap;">SINGLE AMD APU</span></div>
	</div>
	</div>
	<table style="display:table; table-layout:fixed; width:100%; margin:0; border-collapse:collapse; border-radius:0; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px;">
	<tr>
	<td style="border-top:1px solid currentColor; border-right:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">FORMAT</div><div style="font-weight:700;">ROCmFP4 4-BIT</div></td>
	<td style="border-top:1px solid currentColor; border-right:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">PRECISION</div><div style="font-weight:700;">~4.5 BPW</div></td>
	<td style="border-top:1px solid currentColor; border-right:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">ARCH</div><div style="font-weight:700;">QWEN3NEXT</div></td>
	<td style="border-top:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">CONTEXT</div><div style="font-weight:700;">262 K</div></td>
	</tr>
	<tr>
	<td style="border-top:1px solid currentColor; border-right:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">PARAMS</div><div style="font-weight:700;">80B · A3B MoE</div></td>
	<td style="border-top:1px solid currentColor; border-right:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">DRAFT</div><div style="font-weight:700;">NO MTP</div></td>
	<td style="border-top:1px solid currentColor; border-right:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">BACKEND</div><div style="font-weight:700;">VULKAN0</div></td>
	<td style="border-top:1px solid currentColor; padding:8px 12px;"><div style="font-size:10px; letter-spacing:1px; opacity:0.6;">LICENSE</div><div style="font-weight:700;">APACHE-2.0</div></td>
	</tr>
	</table>
	</div>

	<div style="border:2px solid #dc2626; padding:10px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12.5px; margin:14px 0;">
	<b style="color:#dc2626; letter-spacing:1px;">⚠ REQUIRES THE ROCmFP4 FORK</b><br>
	The custom <code>q4_0_rocmfp4</code> / <code>q4_0_rocmfp4_fast</code> tensor types <b>will not load in stock llama.cpp, LM Studio, or Ollama</b>. Build/run with <a href="https://github.com/charlie12345/ROCmFPX">charlie12345/ROCmFPX</a> · branch <code>mtp-rocmfp4-strix</code>.
	</div>

	<div style="border:1px solid currentColor; padding:8px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px; margin:14px 0; opacity:0.85;">
	<b>NOTE //</b> Ignore HuggingFace's auto-detected "F16"/16-bit badge — its parser can't read ROCmFP4 and mislabels the file. These are <b>~4.5 bpw 4-bit</b> ROCmFP4 files; pick by filename in <i>Files and versions</i>.
	</div>

	Experimental AMD Strix Halo (gfx1151) quant of [Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) — Qwen's agentic coding model (80B total / 3B active high-sparsity MoE, hybrid Gated-DeltaNet attention, arch `qwen3next`, 262K context) — in the custom ROCmFP4 4-bit format, imatrix-quantized with a code-weighted importance matrix.

	<div style="font-family:ui-monospace,'SF Mono',Consolas,monospace; font-weight:800; font-size:14px; letter-spacing:2px; text-transform:uppercase; border-bottom:2px solid currentColor; padding-bottom:5px; margin:26px 0 12px;"><span style="color:#ea580c;">01</span> · FILES</div>

	<div style="overflow:hidden; border-radius:0;">
	<table style="width:100%; border-collapse:collapse; border-radius:0; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12.5px;">
	<thead><tr>
	<th style="border:1px solid currentColor; padding:7px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">File</th>
	<th style="border:1px solid currentColor; padding:7px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">Output head</th>
	<th style="border:1px solid currentColor; padding:7px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">Pick if</th>
	</tr></thead>
	<tbody>
	<tr><td style="border:1px solid currentColor; padding:7px 10px;"><code>…-STRIX-embQ8-imatrix-headQ6.gguf</code> ★</td><td style="border:1px solid currentColor; padding:7px 10px;">Q6_K</td><td style="border:1px solid currentColor; padding:7px 10px;"><b>the one build</b> — best speed/quality balance: Q8 embeddings + Q6 output head on the fast single-scale body</td></tr>
	</tbody>
	</table>
	</div>

	One file — the best speed/quality balance in ROCmFP4 for Strix Halo. It keeps the two quality levers that are actually felt — Q8 token embeddings (matching the Q8 source exactly) and a Q6_K output head — on the fast single-scale `q4_0_rocmfp4_fast` body + a code-weighted imatrix. Not the most faithful possible (see the fidelity link in §04) — it's the point where speed and quality meet best. The DeltaNet-specific tensors (`ssm_conv1d`, `ssm_a`, norms, router) stay F32; MoE experts + attention/SSM projections are 4-bit ROCmFP4.

	<div style="border:1px solid currentColor; padding:8px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px; margin:12px 0; opacity:0.85;">
	<b>NOTE //</b> <b>Q8 embeddings</b> (not f16): the source is Q8_0, so Q8 matches its precision exactly — f16 would be fake-f16 bloat for zero gain (embeddings are a lookup, not a matmul).
	</div>

	<div style="font-family:ui-monospace,'SF Mono',Consolas,monospace; font-weight:800; font-size:14px; letter-spacing:2px; text-transform:uppercase; border-bottom:2px solid currentColor; padding-bottom:5px; margin:26px 0 12px;"><span style="color:#ea580c;">02</span> · QUICK START</div>

	Run from the folder holding the `.gguf` (the Qwen ChatML template is baked in — just pass `--jinja`):

	```bash
	env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
	llama-server \
	-m Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf \
	--alias coder-next \
	--host 0.0.0.0 \
	--port 8080 \
	-c 262144 \
	-ctk q8_0 \
	-ctv q8_0 \
	--temp 0.7 \
	--top-p 0.8 \
	--top-k 20 \
	-dev Vulkan0 \
	-ngl 999 \
	-fa on \
	-b 2048 \
	-ub 256 \
	-t 16 \
	-tb 16 \
	-cpent 256 \
	-ctxcp 32 \
	--cache-reuse 256 \
	--cache-ram 65536 \
	--jinja \
	--parallel 1 \
	--metrics \
	--no-mmap
	```

	<div style="overflow:hidden; border-radius:0;">
	<table style="width:100%; border-collapse:collapse; border-radius:0; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px;">
	<thead><tr>
	<th style="border:1px solid currentColor; padding:6px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px; width:40%;">Flag</th>
	<th style="border:1px solid currentColor; padding:6px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">Function</th>
	</tr></thead>
	<tbody>
	<tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>HSA_OVERRIDE_GFX_VERSION=11.5.1</code></td><td style="border:1px solid currentColor; padding:6px 10px;">treat the APU as gfx1151 (Strix Halo)</td></tr>
	<tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>GGML_HIP_ENABLE_UNIFIED_MEMORY=1</code></td><td style="border:1px solid currentColor; padding:6px 10px;">allow use of the full 128 GB unified memory</td></tr>
	<tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>-dev Vulkan0</code></td><td style="border:1px solid currentColor; padding:6px 10px;">run on Vulkan — fastest backend for ROCmFP4 on Strix Halo</td></tr>
	<tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>-ngl 999 · -fa on</code></td><td style="border:1px solid currentColor; padding:6px 10px;">offload all layers · flash attention</td></tr>
	<tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>-c 262144</code></td><td style="border:1px solid currentColor; padding:6px 10px;">context length (256K)</td></tr>
	<tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>-b 2048 · -ub 256 · -t/-tb 16</code></td><td style="border:1px solid currentColor; padding:6px 10px;">prefill batch / micro-batch · CPU threads</td></tr>
	<tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>-ctk q8_0 · -ctv q8_0</code></td><td style="border:1px solid currentColor; padding:6px 10px;">q8_0 (8-bit) KV cache — how we run it; drop to <code>q4_0</code> to use less memory, or raise to <code>f16</code></td></tr>
	<tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>-cpent · -ctxcp · --cache-reuse · --cache-ram 65536</code></td><td style="border:1px solid currentColor; padding:6px 10px;">cross-turn KV checkpointing + 64 GB resident reuse cache</td></tr>
	<tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>--temp 0.7 --top-p 0.8 --top-k 20</code></td><td style="border:1px solid currentColor; padding:6px 10px;">Qwen-Coder recommended sampling</td></tr>
	<tr><td style="border:1px solid currentColor; padding:6px 10px;"><code>--jinja --parallel 1 --metrics --no-mmap</code></td><td style="border:1px solid currentColor; padding:6px 10px;">apply baked ChatML template · single slot · metrics · weights in RAM</td></tr>
	</tbody>
	</table>
	</div>

	<div style="border:1px solid currentColor; padding:8px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px; margin:12px 0; opacity:0.85;">
	<b>NOTE //</b> No <code>--spec-*</code> / <code>--spec-type draft-mtp</code> flags — this arch has <b>no MTP head</b> (see §04). It's already fast on its own.
	</div>

	<div style="font-family:ui-monospace,'SF Mono',Consolas,monospace; font-weight:800; font-size:14px; letter-spacing:2px; text-transform:uppercase; border-bottom:2px solid currentColor; padding-bottom:5px; margin:26px 0 12px;"><span style="color:#ea580c;">03</span> · AGENTIC CODING / TOOLS</div>

	Qwen3-Coder-Next is an agentic coder — built to call tools, not narrate code. To wire it up:

	- Chat template: Qwen (ChatML) is baked into the GGUF — just pass `--jinja` and your client applies it automatically.
	- Tool calling: enable the `qwen3_coder` tool-call parser in your client (e.g. the matching parser flag in llama-server / your agent harness). Without it, native tool calls won't be parsed and the model tends to narrate code instead of calling tools.
	- Sampling: temp `0.7`, top-p `0.8`, top-k `20` (Qwen-Coder recommended) — already set in §02.

	<div style="border:1px solid currentColor; padding:8px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px; margin:12px 0; opacity:0.85;">
	<b>NOTE //</b> The cross-turn reuse cache (<code>--cache-reuse</code> / <code>--cache-ram</code>) keeps long agentic sessions cheap — the leading prompt isn't re-prefilled every turn.
	</div>

	<div style="font-family:ui-monospace,'SF Mono',Consolas,monospace; font-weight:800; font-size:14px; letter-spacing:2px; text-transform:uppercase; border-bottom:2px solid currentColor; padding-bottom:5px; margin:26px 0 12px;"><span style="color:#ea580c;">04</span> · PERFORMANCE & QUALITY</div>

	<div style="overflow:hidden; border-radius:0;">
	<table style="width:100%; border-collapse:collapse; border-radius:0; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12.5px;">
	<tbody>
	<tr><td style="border:1px solid currentColor; padding:8px 11px; width:42%;">DECODE · short context</td><td style="border:1px solid currentColor; padding:8px 11px; font-weight:700;">~54 t/s (Vulkan / Ryzen AI Max+ 395)</td></tr>
	<tr><td style="border:1px solid currentColor; padding:8px 11px;">SPECULATIVE DECODE</td><td style="border:1px solid currentColor; padding:8px 11px; font-weight:700;">none (no MTP head)</td></tr>
	<tr><td style="border:1px solid currentColor; padding:8px 11px;">LONG CONTEXT</td><td style="border:1px solid currentColor; padding:8px 11px;">cheap — DeltaNet near-constant memory</td></tr>
	<tr><td style="border:1px solid currentColor; padding:8px 11px;">QUANTIZATION</td><td style="border:1px solid currentColor; padding:8px 11px;">fast single-scale body + Q8 emb + Q6 head + code-weighted imatrix (measured win — below)</td></tr>
	</tbody>
	</table>
	</div>

	This is the best speed/quality balance in ROCmFP4 — by design, not the absolute fastest. On top of the imatrix + Q8 emb + Q6 head, we swept the body kernel against the Q8 source by KL divergence (the right fidelity metric). An all-dual-scale body did edge the fast single-scale body on KL, but the gain sat inside the measurement noise while costing decode speed — so the fast single-scale body + Q8 embeddings + Q6 head is the right point, and the one file we ship.

	This mirrors the fuller sweep on our [Qwen3.6-27B sibling](https://huggingface.co/plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF), where every higher-precision body lever (all-dual-scale, selective Q5/Q6 bumps) bought a KL improvement inside the noise at a real speed cost — and where copying an entire dynamic-quant high-precision allocation onto ROCmFP4 still couldn't match a true dynamic K-quant, because FP4 is intrinsically less faithful than Q4_K's 4-bit. The same format limit applies here: within ROCmFP4, fast body + Q8 emb + Q6 head is the optimal balance; for maximum fidelity reach for a dynamic K-quant of the base (box below). (Directional internal measurements — KL vs Q8 on held-out code; reproduce before citing.)

	<div style="border:1px solid currentColor; padding:8px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px; margin:12px 0; opacity:0.9;">
	<b>WANT MAXIMUM FIDELITY INSTEAD OF SPEED?</b> Grab a <b>Q6_K / Q8 dynamic GGUF of the base</b> from <a href="https://huggingface.co/Qwen/Qwen3-Coder-Next"><b>Qwen/Qwen3-Coder-Next</b></a> — higher-bit GGUFs run on this same fork. We optimize for throughput in ROCmFP4; if you want the last bit of fidelity over speed, that's the one to grab.
	</div>

	Fast even without speculative decoding. 3B active params + linear Gated-DeltaNet attention → ~54 t/s short-context decode on a Ryzen AI Max+ 395 (Vulkan0), and cheap long context. No MTP needed.

	<div style="border:1px solid currentColor; padding:8px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px; margin:12px 0; opacity:0.85;">
	<b>NOTE // NO MTP</b> Qwen3-Coder-Next ships <b>without</b> an MTP head, and the ROCmFP4 fork currently wires MTP drafting only for the <code>qwen35</code>/<code>qwen35moe</code> archs, <b>not</b> <code>qwen3next</code>. So these are <b>no-MTP</b> (non-speculative) builds — in practice it doesn't matter, it's fast on its own.
	</div>

	The imatrix — code-weighted, and measured (a clean win here). Quantized with an importance matrix built from a code-weighted calibration mix (~2.6:1 code:general): real multi-language source + code-analysis prompts from [`eaddario/imatrix-calibration`](https://huggingface.co/datasets/eaddario/imatrix-calibration), plus Kalomaze's `groups_merged` (via [`froggeric/imatrix`](https://huggingface.co/datasets/froggeric/imatrix)) for general.

	KL-divergence + perplexity vs the Q8 reference on a held-out code slice (disjoint from calibration), imatrix vs no-imatrix:

	<div style="overflow:hidden; border-radius:0;">
	<table style="width:100%; border-collapse:collapse; border-radius:0; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12.5px;">
	<thead><tr>
	<th style="border:1px solid currentColor; padding:7px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">Metric (vs Q8, held-out code)</th>
	<th style="border:1px solid currentColor; padding:7px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">No-imatrix</th>
	<th style="border:1px solid currentColor; padding:7px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">Imatrix</th>
	<th style="border:1px solid currentColor; padding:7px 10px; text-align:left; text-transform:uppercase; font-size:10px; letter-spacing:1px;">Change</th>
	</tr></thead>
	<tbody>
	<tr><td style="border:1px solid currentColor; padding:7px 10px;"><b>Median KLD</b></td><td style="border:1px solid currentColor; padding:7px 10px;">0.00597</td><td style="border:1px solid currentColor; padding:7px 10px; font-weight:700;">0.00478</td><td style="border:1px solid currentColor; padding:7px 10px; font-weight:700;">−20%</td></tr>
	<tr><td style="border:1px solid currentColor; padding:7px 10px;">90th-pct KLD</td><td style="border:1px solid currentColor; padding:7px 10px;">0.1342</td><td style="border:1px solid currentColor; padding:7px 10px;">0.1083</td><td style="border:1px solid currentColor; padding:7px 10px;">−19%</td></tr>
	<tr><td style="border:1px solid currentColor; padding:7px 10px;"><b>RMS Δp</b></td><td style="border:1px solid currentColor; padding:7px 10px;">8.14%</td><td style="border:1px solid currentColor; padding:7px 10px; font-weight:700;">7.36%</td><td style="border:1px solid currentColor; padding:7px 10px; font-weight:700;">−10%</td></tr>
	<tr><td style="border:1px solid currentColor; padding:7px 10px;"><b>Same top token as Q8</b></td><td style="border:1px solid currentColor; padding:7px 10px;">91.01%</td><td style="border:1px solid currentColor; padding:7px 10px; font-weight:700;">91.49%</td><td style="border:1px solid currentColor; padding:7px 10px; font-weight:700;">+0.48 pp</td></tr>
	<tr><td style="border:1px solid currentColor; padding:7px 10px;">Mean PPL</td><td style="border:1px solid currentColor; padding:7px 10px;">3.4556</td><td style="border:1px solid currentColor; padding:7px 10px;">3.4686</td><td style="border:1px solid currentColor; padding:7px 10px;">+0.013 (within ±0.077 noise — a wash)</td></tr>
	</tbody>
	</table>
	</div>

	So the imatrix measurably improves quantization fidelity to the full model on code (median KL −20%, the gold-standard metric), at zero cost (same size/speed). PPL is a statistical wash. Honest scope: this is a fidelity-vs-Q8 measurement on ~20 K tokens of held-out code, not an absolute coding benchmark.

	<div style="border:1px solid currentColor; padding:8px 13px; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12px; margin:12px 0; opacity:0.85;">
	<b>NOTE //</b> On "dual imatrix": a plain merge of two imatrices is mathematically identical to concatenating the corpora at the same ratio — the only real lever is the code:general ratio, which is what's set here. True size-decoupled balancing would need normalized-merge tooling; not used.
	</div>

	<div style="font-family:ui-monospace,'SF Mono',Consolas,monospace; font-weight:800; font-size:14px; letter-spacing:2px; text-transform:uppercase; border-bottom:2px solid currentColor; padding-bottom:5px; margin:26px 0 12px;"><span style="color:#ea580c;">05</span> · BUILD (REPRODUCIBLE)</div>

	```bash
	# code-weighted imatrix on the Q8 (single pass; ratio = the real lever)
	llama-imatrix -m Qwen3-Coder-Next-Q8_0.gguf -f code-weighted-calib.txt -o coder-next.imatrix -c 512 -ngl 999

	# quant -> ROCmFP4 with the imatrix (Q8 embeddings) + Q6 output head — the ★ file (§01)
	# fast single-scale body; --output-tensor-type q6_K raises the output head to Q6_K
	llama-quantize --allow-requantize --token-embedding-type q8_0 --output-tensor-type q6_K --imatrix coder-next.imatrix \
	Qwen3-Coder-Next-Q8_0.gguf Qwen3-Coder-Next-ROCmFP4-STRIX-embQ8-imatrix-headQ6.gguf Q4_0_ROCMFP4_STRIX
	```

	> Experimental research build for AMD Strix Halo — hardware/driver/prompt-sensitive, may not reproduce elsewhere. Not native FP4 tensor-core execution.

	<div style="font-family:ui-monospace,'SF Mono',Consolas,monospace; font-weight:800; font-size:14px; letter-spacing:2px; text-transform:uppercase; border-bottom:2px solid currentColor; padding-bottom:5px; margin:26px 0 12px;"><span style="color:#ea580c;">06</span> · LINEAGE & CREDITS</div>

	<div style="overflow:hidden; border-radius:0;">
	<table style="width:100%; border-collapse:collapse; border-radius:0; font-family:ui-monospace,'SF Mono',Consolas,monospace; font-size:12.5px;">
	<tbody>
	<tr><td style="border:1px solid currentColor; padding:8px 11px; width:26%;">BASE MODEL</td><td style="border:1px solid currentColor; padding:8px 11px;"><a href="https://huggingface.co/Qwen/Qwen3-Coder-Next">Qwen/Qwen3-Coder-Next</a> (Apache-2.0, Qwen team) · 80B-A3B MoE, arch <code>qwen3next</code></td></tr>
	<tr><td style="border:1px solid currentColor; padding:8px 11px;">CALIBRATION</td><td style="border:1px solid currentColor; padding:8px 11px;"><a href="https://huggingface.co/datasets/eaddario/imatrix-calibration">eaddario/imatrix-calibration</a> (code) · Kalomaze <code>groups_merged</code> via <a href="https://huggingface.co/datasets/froggeric/imatrix">froggeric/imatrix</a> (general)</td></tr>
	<tr><td style="border:1px solid currentColor; padding:8px 11px;">FORMAT + RUNTIME</td><td style="border:1px solid currentColor; padding:8px 11px;"><a href="https://github.com/charlie12345/ROCmFPX">charlie12345/ROCmFPX</a> (based on llama.cpp, MIT)</td></tr>
	</tbody>
	</table>
	</div>

	Derivative quantization — verify the base model's license before redistribution / use.