PLUNDERSTRUCK // ROCmFP4 QUANTIZED MODEL // STRIX HALO · gfx1151
            ▗▇▇▇▇▇▇▇▖                 
           ▗█▘▝██████▖                
          ▗▛   ▝██████▆▆▆▆▆▆▆▆▆▆▅     
         ▟▛    ▗█████████████████▙▖   
   ▄▄▄▄▄▟▛    ▟████████████████████▖  
 ▗██▌    ▚▖   ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔█▘  
▗████▖    ▜▖                    ▗█▘   
▜█████▙    ▜▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▀▀▀▀▀▜▙    
 ▜█████▙    ▝████████████▛       ▜▙   
  ▜█████▙    ▝██████████▛    ▃    ▜▙  
   ▀█████▙▖   ▝████████▘    ▟█▙    ▀▙ 
    ▝██████▖   ▝▜█████▘    ▟███▙▂▂▂▂▐█
    ▟███████▖    ▜███▘   ▗███████████▛
   ▟█████████▄    ▜▛    ▗███████████▀ 
  ▝█████▀        ▗▛    ▗██████▀▀▀▀▀▘  
    ▜██▘        ▗▛    ▟█████▛▘        
     ▜█▇▇▇▇▇▇▇▇▇█▖   ▟█████▛          
                ▝█▖ ▟█████▛           
                 ▝███████▀            
QWEN3.6-35B-A3B-MTP
4-BIT ROCmFP4 · MIXTURE-OF-EXPERTS (A3B) · MTP SELF-SPECULATIVE DECODE · SINGLE AMD APU

    
      FORMAT
ROCmFP4 4-BIT

      PRECISION
4.44 BPW

      SIZE
18.8 GB

      CONTEXT
262 K

    

      ARCH
MoE · 256 EXPERTS

      ACTIVE / HIDDEN
~3B · 2048

      DRAFT
MTP n-max 5

      BACKEND
VULKAN0

    

⚠ REQUIRES THE ROCmFP4 FORK

The custom q4_0_rocmfp4 / q4_0_rocmfp4_fast tensor types will not load in stock llama.cpp, LM Studio, Ollama, Jan, or koboldcpp. Build/run with charlie12345/ROCmFPX · branch mtp-rocmfp4-strix.

NOTE // Ignore HuggingFace's auto-detected "F16" badge — its parser can't read ROCmFP4 and mislabels by the genuinely-f16 token embeddings. These are ~4.44 bpw 4-bit files; pick by filename.

01 · FILES

File	Size	Output head	Pick if
`…-STRIX-embF16-headQ6.gguf` ★	18.8 GB	Q6_K	the one build — best speed/quality balance: f16 embeddings + Q6 output head on the fast single-scale body

One file — the best speed/quality balance in ROCmFP4 for Strix Halo. It keeps the two quality levers that are actually felt — genuine f16 token embeddings (from BF16) and a Q6_K output head — on the fast single-scale q4_0_rocmfp4_fast body, with the F32 MoE router and the MTP head preserved (no imatrix). Not the leanest-fastest possible, and not the most faithful possible (see the Unsloth fidelity link in §03) — it's the point where speed and quality meet best. Vision: this MoE is multimodal (Qwen3.6 is natively VL) — the repo bundles the mmproj-F32.gguf Qwen3-VL projector (projection_dim 2048, matched to the MoE hidden size; verified reading a test image) plus chat_template.jinja (tool calls + think-toggle + vision).

NOTE // The Q6_K output head — the layer that turns the final hidden state into the next-token choice — is raised from 4-bit ROCmFP4 to standard Q6_K in this build. It's the output-side complement to running f16 token embeddings, sharpening both ends of the model; the head/embedding trade-off is characterized in detail on the 27B card.

02 · QUICK START

Run from the folder holding the .gguf + chat_template.jinja:

env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server \
  -m Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16-headQ6.gguf \
  --alias qwen35b-a3b-mtp \
  --host 0.0.0.0 \
  --port 8080 \
  -c 262144 \
  -ctk f16 \
  -ctv f16 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  -dev Vulkan0 \
  -ngl 999 \
  -fa on \
  -b 2048 \
  -ub 256 \
  -t 16 \
  -tb 16 \
  -cpent 256 \
  -ctxcp 32 \
  --cache-reuse 256 \
  --cache-ram 65536 \
  --jinja \
  --parallel 1 \
  --metrics \
  --no-mmap \
  --spec-type draft-mtp \
  --spec-draft-device Vulkan0 \
  --spec-draft-ngl all \
  --spec-draft-type-k f16 \
  --spec-draft-type-v f16 \
  --spec-draft-n-max 5 \
  --spec-draft-n-min 0 \
  --spec-draft-p-min 0.0 \
  --spec-draft-p-split 0.10 \
  --chat-template-file chat_template.jinja \
  --reasoning on \
  --reasoning-format deepseek \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --mmproj mmproj-F32.gguf \
  --image-min-tokens 1024

The last two lines enable vision — the bundled mmproj-F32.gguf is the Qwen3-VL projector for this MoE (projection_dim 2048); omit them for text-only. --image-min-tokens 1024 is required whenever --mmproj is set.

Flag	Function
`HSA_OVERRIDE_GFX_VERSION=11.5.1`	treat the APU as gfx1151 (Strix Halo)
`GGML_HIP_ENABLE_UNIFIED_MEMORY=1`	allow use of the full 128 GB unified memory
`-dev Vulkan0`	run on Vulkan (KHR_coopmat) — beats ROCm here for ROCmFP4 on Strix Halo
`-ngl 999 · -fa on`	offload all layers · flash attention
`-c 262144`	context length (256K) — loads with ~92 GB free; KV footprint is modest
`-b 2048 · -ub 256 · -t/-tb 16`	prefill batch / micro-batch (256 = prefill optimum) · CPU threads
`-ctk f16 · -ctv f16`	f16 KV cache — how we run it; drop to `q8_0`/`q4_0` to use less memory
`-cpent · -ctxcp · --cache-reuse · --cache-ram 65536`	cross-turn KV checkpointing + 64 GB resident reuse cache
`--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0`	Qwen3.6 "precise coding" sampling (1.0 for general)
`--spec-type draft-mtp · --spec-draft-n-max 5`	built-in MTP head, self-speculative; draft depth 5
`--spec-draft-device Vulkan0 · -ngl all · type-k/v f16`	draft head on Vulkan, fully offloaded, f16 KV
`--chat-template-file chat_template.jinja`	froggeric unified Qwen3.6 template (tool calls + think-toggle)
`--reasoning on --reasoning-format deepseek + kwargs {preserve_thinking:true}`	keep `<think>` across turns with clean `content`+`reasoning_content`, so cross-turn cache survives
`--jinja --parallel 1 --metrics --no-mmap`	apply template · single slot · metrics · weights in RAM

Multi-turn prompt-cache reuse (OpenCode). Qwen3.6's recurrent state can't partial-rewind, so multi-turn reuse needs a context checkpoint. Two defaults otherwise force a full re-prefill every turn; both are fixed above:

Checkpoints — default -cpent is 8192, so prompts under 8K never checkpoint. Fix: -cpent 256 -ctxcp 32 --cache-reuse 256.
Thinking — --reasoning-format deepseek + --chat-template-kwargs '{"preserve_thinking": true}' keeps <think> across turns with clean content+reasoning_content. (none = raw tags inline but works with any content-echoing client; deepseek-legacy/auto do not reuse.)

--jinja is required for the chat template + preserve_thinking.

OpenAI-compatible client (e.g. OpenCode). In single-model mode llama-server ignores the request's model field, so the client's model name is just a label.

Base URL: http://<host>:8080/v1 · API key: any non-empty string (e.g. sk-local)
Model id this server reports: qwen35b-a3b-mtp

A patched OpenCode that compacts conversation history without invalidating the prompt cache is at PlunderStruck/opencode — pair it with the checkpoint flags to keep long sessions fast.

03 · PERFORMANCE & QUALITY

This is the best speed/quality balance in ROCmFP4 — by design, not the absolute fastest. It keeps the two quality levers that are actually felt — genuine f16 token embeddings and a Q6_K output head — on the fast single-scale body, with the F32 MoE router untouched. We tested the alternatives within rocmfp4 (an all-dual-scale body, selective higher-precision tensors); they cost decode speed for a KL improvement that sat inside the measurement noise, so the fast single-scale body + f16 embeddings + Q6 head is the right point. A leaner build (no Q6 head, or Q5 embeddings) is a few tok/s faster but degrades a quality lever you'll notice; we keep both.

WANT MAXIMUM FIDELITY INSTEAD OF SPEED? Unsloth's UD-Q4_K_XL (a dynamic K-quant) runs on this same fork, and MTP still works (it carries the nextn head) — at roughly ~2× lower KL divergence vs BF16, at a slower decode. We optimize for throughput in ROCmFP4; if you want the last bit of fidelity over speed, that's the one to grab.

How we landed on this recipe. We ran the full lever sweep on the 27B dense sibling — measuring every rocmfp4 build against the BF16 reference by KL divergence (the right fidelity metric) plus decode speed (llama-bench), and comparing to the best stock 4-bit. The finding generalizes here: an all-dual-scale body (COHERENT) and selective higher-precision bumps (DYN) both trade decode speed for a KL gain that sits inside the noise, while even copying Unsloth's entire high-precision allocation onto rocmfp4 still can't match a dynamic K-quant's fidelity — that's a format limit (rocmfp4's FP4 is intrinsically less faithful than Q4_K's 4-bit, a fidelity floor you can't out-allocate). So within rocmfp4 the fast body + f16 embeddings + Q6 head is the optimal balance (this file), and for maximum fidelity we link the dynamic K-quant rather than ship a worse copy. The numbered sweep — full experiments table, KLD numbers, and verdicts — is on the 27B card (those figures are 27B-specific; this 35B MoE follows the same frontier). (Directional internal measurements — reproduce before citing.)

Hands-on, on a Framework Desktop / AMD Ryzen AI Max+ 395 (gfx1151, 128 GB unified, ROCm 7.2.0):

DECODE	78–90 t/s (Vulkan / Strix Halo)
MTP DRAFT ACCEPTANCE	~0.6–0.95 (content-dependent)
CONTEXT @ LOAD	full 262144 with ~92 GB free
QUANTIZATION	non-imatrix · F32 MoE router

MoE decode is naturally fast — only ~3B params active per token — and the F32 router keeps expert selection clean. The router stays F32 for free: the quantizer excludes expert-gating tensors (ffn_gate_inp) from quantization, so routing — which experts each token goes to, a discrete, high-sensitivity decision — keeps full precision automatically, while the experts run on the custom ROCmFP4 kernel.

NOTE // f16 KV is how we run it (128 GB unified affords it; drop to q8_0/q4_0 to save memory). The Q6_K output head adds ~0.14 GB and a fixed per-token cost (a few % slower decode at short context, shrinking at long context) for a measurable fidelity gain — see the 27B card.

The companion 27B dense quant (same recipe) is at plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF.

04 · BUILD (REPRODUCIBLE)

Build the fork:

git clone https://github.com/charlie12345/ROCmFPX
cd ROCmFPX
env JOBS=16 scripts/build-strix-rocmfp4-mtp.sh

Quantize from the unsloth BF16+MTP GGUF — ROCmFP4 body, genuine f16 embeddings, no imatrix:

# the one build: STRIX preset + f16 embeddings + Q6_K output head
llama-quantize \
  --token-embedding-type f16 \
  --output-tensor-type q6_K \
  Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf \
  Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16-headQ6.gguf \
  Q4_0_ROCMFP4_STRIX

Architecture (qwen35moe): 41 blocks, 2048 hidden, 256 experts, with the nextn_predict_layers=1 MTP head (blk.40.nextn.*) — so self-speculative draft-MTP survives quantization. Format: ROCmFP4 is a 4-bit weight format for AMD using an FP4-derived value codebook plus one (FAST) or two (dual) UE4M3/FP8 scale bytes per 32-weight block; tensor-aware. This build (STRIX-embF16-headQ6): quality-biased STRIX preset + f16 token embeddings (full precision; a lookup, so ~zero decode cost) + a Q6_K output head. Experts (ffn_*_exps) run q4_0_rocmfp4_fast; attention K/V (+ fused QKV) run q4_0_rocmfp4 (dual-scale).

Experimental research build for AMD Strix Halo — hardware-, driver-, model-, and prompt-sensitive, may not reproduce on other GPUs. Not native FP4 tensor-core execution. Do not treat these numbers as upstream llama.cpp claims. Base BF16 GGUF pinned at revision 5bc3e238d916f48a861bac2f8a1990a0e9b7e98d.

05 · LINEAGE & CREDITS

BASE MODEL	Qwen3.6-35B-A3B (Qwen team) — derivative quantization that inherits the base model's license; verify the original Qwen3.6 terms before redistribution / use
BF16 GGUF SOURCE	unsloth/Qwen3.6-35B-A3B-MTP-GGUF @ `5bc3e238d916f48a861bac2f8a1990a0e9b7e98d`
FORMAT + RUNTIME	charlie12345/ROCmFPX (based on llama.cpp, MIT)
CHAT TEMPLATE	froggeric/Qwen-Fixed-Chat-Templates

Derivative quantization — verify the base model's license before redistribution / use.

Downloads last month: 2,285

GGUF

Model size

0.4B params

Architecture

clip

Hardware compatibility

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

unsloth/Qwen3.6-35B-A3B-MTP-GGUF

Quantized

(7)

this model

Collection including plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF

ROCmFP4 MTP · Strix Halo

Collection

Self-speculative MTP quants in custom ROCmFP4 4-bit for AMD Strix Halo (gfx1151). Needs the charlie12345/rocmfp4-llama fork. • 5 items • Updated Jun 14 • 3

FORMAT ROCmFP4 4-BIT	PRECISION 4.44 BPW	SIZE 18.8 GB	CONTEXT 262 K
ARCH MoE · 256 EXPERTS	ACTIVE / HIDDEN ~3B · 2048	DRAFT MTP n-max 5	BACKEND VULKAN0