Qwen3.6-35B-A3B-MTP — ROCmFP4 STRIX (f16 embeddings)

Experimental AMD Strix Halo (gfx1151) quant of Qwen3.6-35B-A3B — a mixture-of-experts model (~3B active params) with the built-in MTP / next-token-prediction head — in the custom ROCmFP4 4-bit format. Tuned for high MTP draft acceptance and long-context, multi-turn use.

Requires the ROCmFP4 fork (public) — not stock llama.cpp

This file uses the ROCmFP4 tensor types (q4_0_rocmfp4, q4_0_rocmfp4_fast). Stock llama.cpp, LM Studio, Ollama, Jan, koboldcpp, etc. cannot load it. Build and run it with the public fork charlie12345/rocmfp4-llama:

git clone https://github.com/charlie12345/rocmfp4-llama
cd rocmfp4-llama && git checkout mtp-rocmfp4-strix
env JOBS=16 scripts/build-strix-rocmfp4-mtp.sh

Part 1 — The model

What this is

  • Base: unsloth/Qwen3.6-35B-A3B-MTP-GGUF BF16, pinned at revision 5bc3e238d916f48a861bac2f8a1990a0e9b7e98d. Arch qwen35moe: 41 blocks, 2048 hidden, 256 experts, with the nextn_predict_layers=1 MTP head (blk.40.nextn.*), so self-speculative draft-MTP survives quantization.
  • Format: ROCmFP4 — a 4-bit weight format for AMD using an FP4-derived value codebook plus one (FAST) or two (dual) UE4M3/FP8 scale bytes per 32-weight block. Tensor-aware.
  • This variant (STRIX-embF16): quality-biased STRIX preset + f16 token embeddings (full precision; it's a lookup, so ~zero decode cost). No imatrix.
value
File Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16.gguf
Size / bpw 18.8 GB / 4.44 bpw
token_embd F16
MoE router (ffn_gate_inp) F32 (full precision, kept automatically)
experts (ffn_*_exps) q4_0_rocmfp4_fast (custom kernel)
attention K/V (+ fused QKV) q4_0_rocmfp4 (dual-scale)
MTP head preserved (blk.40.nextn.*)

The router stays F32 for free. The quantizer excludes expert-gating tensors from quantization, so routing (which experts each token goes to — a discrete, high-sensitivity decision) keeps full precision automatically, while the experts run on the custom ROCmFP4 kernel.

How it was built (reproducible)

llama-quantize \
  --token-embedding-type f16 \
  Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf \
  Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16.gguf \
  Q4_0_ROCMFP4_STRIX

Status & caveats

Experimental research build. Results are hardware-, driver-, model-, and prompt-sensitive, and tuned for AMD Strix Halo — they may not reproduce on other GPUs. This is not native FP4 tensor-core execution. Do not treat these numbers as upstream llama.cpp claims.

Credits & license

  • Base model: Qwen3.6-35B-A3B (Qwen team) — a derivative quantization that inherits the base model's license; verify the original Qwen3.6 terms before redistribution/use.
  • BF16 GGUF source: unsloth/Qwen3.6-35B-A3B-MTP-GGUF @ 5bc3e238d916f48a861bac2f8a1990a0e9b7e98d.
  • ROCmFP4 format & runtime: charlie12345/rocmfp4-llama (based on llama.cpp, MIT).

Part 2 — Making practical use of it

What I observed

Hands-on, on a Framework Desktop / AMD Ryzen AI Max+ 395 (gfx1151, 128 GB unified, ROCm 7.2.0):

  • It runs great on the ROCmFP4 fork78–90 t/s decode, MTP draft acceptance ~0.6–0.95 (content-dependent), coherent, and it loads at full 262144 context with ~92 GB free (the KV footprint is modest). MoE decode is naturally fast (only ~3B params active per token), and the F32 router keeps expert selection clean.
  • The companion 27B dense quant (same recipe) is at plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF.

Run config (highest MTP acceptance on Strix Halo)

Full-precision (f16) KV is the dominant acceptance lever; 128 GB unified affords it (drop to -ctk q8_0 -ctv q8_0 on less memory).

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server -m Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16.gguf \
  --alias qwen3.6-35b-a3b-rocmfp4-mtp --host 0.0.0.0 --port 8080 \
  -dev Vulkan0 -ngl 999 -fa on \
  -c 262144 -b 2048 -ub 256 -t 16 -tb 16 \
  -ctk f16 -ctv f16 \
  -cpent 256 -ctxcp 32 --cache-reuse 256 \
  --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \
  --presence-penalty 0.0 --repeat-penalty 1.0 \
  --spec-type draft-mtp --spec-draft-device Vulkan0 --spec-draft-ngl all \
  --spec-draft-type-k f16 --spec-draft-type-v f16 \
  --spec-draft-n-max 3 --spec-draft-n-min 0 --spec-draft-p-min 0.0 --spec-draft-p-split 0.10 \
  --reasoning on --reasoning-format deepseek \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --jinja --parallel 1 --metrics --no-mmap

-dev Vulkan0 (KHR_coopmat) beats ROCm here; -ub 256 is the prefill optimum; --spec-type draft-mtp uses the model's built-in MTP head. Temp 0.6 = Qwen3.6 "precise coding" (1.0 for general).

Multi-turn prompt-cache reuse (OpenCode)

Qwen3.6's recurrent state can't partial-rewind, so multi-turn reuse needs a context checkpoint. Two defaults otherwise force a full re-prefill every turn; both are fixed above:

  1. Checkpoints — default -cpent is 8192, so prompts under 8K never checkpoint. Fix: -cpent 256 -ctxcp 32 --cache-reuse 256.
  2. Thinking--reasoning-format deepseek + --chat-template-kwargs '{"preserve_thinking": true}' keeps <think> across turns with clean content+reasoning_content. (none = raw tags inline but works with any content-echoing client; deepseek-legacy/auto do not reuse.)
  3. Vision--mmproj disables cache reuse; keep it off for text/code.

--jinja is required for the chat template + preserve_thinking.

OpenAI-compatible client (e.g. OpenCode)

Point the client at the server. In single-model mode llama-server ignores the request's model field, so the client's model name is just a label.

  • Base URL: http://<host>:8080/v1 · API key: any non-empty string (e.g. sk-local)
  • Model id (what this server reports): qwen3.6-35b-a3b-rocmfp4-mtp

A patched OpenCode that compacts conversation history without invalidating the prompt cache is at PlunderStruck/opencode — pair it with the checkpoint flags to keep long sessions fast.

Downloads last month
-
GGUF
Model size
36B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF

Quantized
(4)
this model