Instructions to use plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF", filename="Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16-headQ6.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16 # Run inference directly in the terminal: llama-cli -hf plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16 # Run inference directly in the terminal: llama-cli -hf plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16 # Run inference directly in the terminal: ./llama-cli -hf plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16
Use Docker
docker model run hf.co/plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16
- LM Studio
- Jan
- Ollama
How to use plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF with Ollama:
ollama run hf.co/plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16
- Unsloth Studio
How to use plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF to start chatting
- Pi
How to use plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16
Run Hermes
hermes
- Docker Model Runner
How to use plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF with Docker Model Runner:
docker model run hf.co/plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16
- Lemonade
How to use plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF:BF16
Run and chat with the model
lemonade run user.Qwen3.6-27B-MTP-ROCmFP4-GGUF-BF16
List all available models
lemonade list
llm.create_chat_completion(
messages = "No input example has been defined for this model task."
)Qwen3.6-27B-MTP — ROCmFP4 STRIX (imatrix + f16 embeddings)
Experimental AMD Strix Halo (gfx1151) quant of Qwen3.6-27B (dense, with the built-in MTP / next-token-prediction head) in the custom ROCmFP4 4-bit format — tuned for high MTP draft acceptance and long-context, multi-turn coding use.
⚠️ Ignore HuggingFace's auto-detected quant badge ("F16" / 16-bit) — it's wrong. HF's parser only knows the standard GGUF quant types, so it can't read the custom ROCmFP4 format. It ends up "seeing" only the genuinely-f16 token embeddings and mislabels the whole file as 16-bit. These are ~4.8 bpw 4-bit ROCmFP4 files, not 16-bit. Pick a file by its name in the Files and versions tab (see the two-files table below).
Requires the ROCmFP4 fork (public) — not stock llama.cpp
This file uses the ROCmFP4 tensor types (
q4_0_rocmfp4,q4_0_rocmfp4_fast). Stock llama.cpp, LM Studio, Ollama, Jan, koboldcpp, etc. cannot load it. Build and run it with the public forkcharlie12345/rocmfp4-llama:git clone https://github.com/charlie12345/rocmfp4-llama cd rocmfp4-llama && git checkout mtp-rocmfp4-strix env JOBS=16 scripts/build-strix-rocmfp4-mtp.sh
Two files in this repo (pick your trade-off)
| File | size | output head | best for |
|---|---|---|---|
…-STRIX-imatrix-embF16.gguf |
16.5 GB | ROCmFP4 4-bit | fastest — the original daily driver |
…-STRIX-imatrix-embF16-headQ6.gguf |
16.9 GB | Q6_K | a notch more faithful — trades ~5–7% decode for it |
The two are identical except one tensor: same STRIX recipe, same f16 embeddings, same imatrix,
same MTP head — they differ only in the output head (output.weight). Most of this card
describes that shared recipe; the section right below is just about the one change.
The Q6-head variant — a step up (experimental)
The f16-embeddings note further down is the change I felt the most: full-precision token embeddings made the model follow instructions noticeably better. This variant does the same thing to the other end of the model — the output head that turns the final hidden state into the next-token choice — raising it from the 4-bit ROCmFP4 format to standard Q6_K, and leaving everything else untouched.
What I observed: a further step up in instruction-following — beyond what the f16 embeddings already gave. Subjectively it's more consistent at actually doing what it's told: reaching for the specific tool I asked for, and sticking to the rules/format of a task, more reliably than the f16-embeddings build alone. The embedding is the input side; the output head is the output side — sharpening both beats sharpening either.
How I checked it wasn't just a vibe. Two measurements, both on held-out text the model never trained or was calibrated on:
Perplexity — how well it predicts held-out text (lower is better). The Q6 head improved both code and prose, where the imatrix on its own only helped code:
Test set daily (4-bit head) Q6 head held-out code 1.8596 1.8550 held-out prose 5.7165 5.6761 KL divergence vs the original BF16 model — how closely its word-probabilities track the full-precision model it's a copy of (lower = more faithful). The Q6 head was closer to BF16 on every measure (mean ≈ 0.0369 → 0.0345, about 6% nearer the original). It still agrees with BF16's top word ~96% of the time either way — so the head mostly sharpens confidence on the same choice rather than flipping it, which is exactly what "follows the rules more consistently" feels like.
These are small but consistent gains — not night-and-day, but they move the right way across two different tests and two text types, which matches what I felt. Small internal checks, not formal benchmarks; reproduce before citing.
The cost. The Q6 head steps off the tuned 4-bit kernel for that one tensor, so decode is ~5–7% slower at short context (a couple tokens/sec on this hardware), and the gap shrinks at long context (the head is a fixed per-token cost that gets diluted as the KV cache grows). Size grows ~0.4 GB. For me the quality is worth it; if you want maximum speed, use the original file above.
Build it yourself — same as the daily driver, with one extra flag (--output-tensor-type q6_K):
llama-quantize \
--imatrix qwen3.6-27b-code.imatrix \
--token-embedding-type f16 \
--output-tensor-type q6_K \
Qwen3.6-27B-BF16-00001-of-00002.gguf \
Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16-headQ6.gguf \
Q4_0_ROCMFP4_STRIX
Part 1 — The model
What this is
- Base:
unsloth/Qwen3.6-27B-MTP-GGUFBF16, pinned at revision5cb35eb3dcbf52dbce5f87dbc64df6aaffadcace. It carries thenextn_predict_layers=1MTP head, so self-speculative draft-MTP survives quantization. - Format: ROCmFP4 — a 4-bit weight format for AMD using an FP4-derived value codebook plus
one (FAST) or two (dual) UE4M3/FP8 scale bytes per 32-weight block. Tensor-aware: sensitive
attention K/V on the dual-scale
q4_0_rocmfp4, the bulk (FFN, lm-head) on the faster single-scaleq4_0_rocmfp4_fast. - This variant (
STRIX-imatrix-embF16):- f16 token embeddings (full precision — it's a lookup, so ~zero decode cost).
- code-calibrated importance matrix (imatrix) applied to all 496 quantizable tensors.
| value | |
|---|---|
| File | Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16.gguf |
| Size / bpw | 16.5 GB / 4.82 bpw |
| token_embd | F16 |
| attention K/V (+ fused QKV) | q4_0_rocmfp4 (dual-scale) |
| FFN, lm-head, rest | q4_0_rocmfp4_fast |
| MTP head | preserved (blk.64.nextn.*) |
How it was built (reproducible)
Calibration corpus (code_calibration.txt): a concatenation of three files from the
froggeric/imatrix dataset —
groups_merged.txt + code.txt + technical.txt (~646 KB total) — code-heavy but diverse
enough to avoid domain overfitting. The resulting imatrix (qwen3.6-27b-code.imatrix, 339 chunks)
is included in this repo, so you can reproduce the quant exactly without recomputing it.
# 1) importance matrix
llama-imatrix -m Qwen3.6-27B-BF16-00001-of-00002.gguf \
-f code_calibration.txt -o qwen3.6-27b-code.imatrix \
-dev Vulkan0 -ngl 999 -fa on -c 512
# 2) quantize: quality-biased STRIX preset + f16 embeddings + imatrix
llama-quantize \
--imatrix qwen3.6-27b-code.imatrix \
--token-embedding-type f16 \
Qwen3.6-27B-BF16-00001-of-00002.gguf \
Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16.gguf \
Q4_0_ROCMFP4_STRIX
Quality (internal perplexity, directional only)
Held-out perplexity at n_ctx=512, vs the same quant without imatrix (embeddings f16 in both):
| Test set | no-imatrix | this (imatrix) |
|---|---|---|
| held-out code | 1.8631 | 1.8596 |
| held-out prose | 5.7109 | 5.7165 |
Tiny improvement on code (the calibration domain), neutral on prose — expected at this bit rate (at 4+ bpw the base quant is already close to the original, so imatrix is a polish, not a transformation). Small internal checks, not rigorous benchmarks; reproduce before citing.
Status & caveats
Experimental research build. Results are hardware-, driver-, model-, and prompt-sensitive, and tuned for AMD Strix Halo — they may not reproduce on other GPUs. This is not native FP4 tensor-core execution. Do not treat these numbers as upstream llama.cpp claims.
Credits & license
- Base model: Qwen3.6-27B (Qwen team) — a derivative quantization that inherits the base model's license; verify the original Qwen3.6 terms before redistribution/use.
- BF16 GGUF source:
unsloth/Qwen3.6-27B-MTP-GGUF@5cb35eb3dcbf52dbce5f87dbc64df6aaffadcace. - ROCmFP4 format & runtime:
charlie12345/rocmfp4-llama(based on llama.cpp, MIT).
Part 2 — Making practical use of it
What I observed (the direction here)
These are hands-on observations from daily use on a Framework Desktop / AMD Ryzen AI Max+ 395 (gfx1151, 128 GB unified, ROCm 7.2.0) — not benchmarks, but the direction I was exploring:
- Raising the token-embedding layer to full precision (f16) made the model follow instructions noticeably better. It was the single change I felt the most — the embedding is the foundation every layer builds on, and the model has a very large vocab, so a faithful embedding pays off. It costs almost nothing on speed because the embedding is a lookup, not a matmul.
- The code-calibrated imatrix is a free polish on top (same size and speed) — small, but in the right direction on code.
- It's fast and genuinely usable day-to-day: MTP self-speculative decoding with full-precision KV gives ~0.87–0.90 draft acceptance, and it holds up at long context.
- It pairs especially well with my OpenCode fork (below), which keeps the prompt cache intact across history compaction — so long coding sessions don't re-prefill every turn.
Run config (highest MTP acceptance on Strix Halo)
Full-precision (f16) KV is the dominant acceptance lever here — it raised draft acceptance to
~0.87–0.90 warm (vs ~0.70–0.76 with q8/q4 KV). 128 GB unified RAM affords it; on less memory drop
to -ctk q8_0 -ctv q8_0 (lower acceptance).
env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server -m Qwen3.6-27B-MTP-ROCmFP4-STRIX-imatrix-embF16.gguf \
--alias qwen3.6-27b-rocmfp4-mtp --host 0.0.0.0 --port 8080 \
-dev Vulkan0 -ngl 999 -fa on \
-c 262144 -b 2048 -ub 256 -t 16 -tb 16 \
-ctk f16 -ctv f16 \
-cpent 256 -ctxcp 32 --cache-reuse 256 \
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \
--presence-penalty 0.0 --repeat-penalty 1.0 \
--spec-type draft-mtp --spec-draft-device Vulkan0 --spec-draft-ngl all \
--spec-draft-type-k f16 --spec-draft-type-v f16 \
--spec-draft-n-max 3 --spec-draft-n-min 0 --spec-draft-p-min 0.0 --spec-draft-p-split 0.10 \
--reasoning on --reasoning-format deepseek \
--chat-template-kwargs '{"preserve_thinking": true}' \
--jinja --parallel 1 --metrics --no-mmap
| Flag | Why |
|---|---|
-dev Vulkan0 |
Vulkan (KHR_coopmat) beats ROCm/HIP here — ~+1.7× prefill |
-ub 256 |
prefill optimum on this APU; bigger ubatch is slower |
-ctk f16 -ctv f16 |
full-precision main KV — the dominant MTP-acceptance lever |
--spec-type draft-mtp + f16 draft KV |
use the model's built-in MTP head; f16 draft KV keeps acceptance high |
--temp 0.6 ... |
Qwen3.6 "precise coding" sampling (temp 1.0 for general tasks) |
Decode (this hardware): ~33 t/s short context, ~18 t/s at ~140K. It's a hybrid SSM + attention model (48 SSM + 17 attention blocks), so only the attention layers grow a KV cache — it degrades gracefully at long context.
Multi-turn prompt-cache reuse (the part that makes it usable)
Qwen3.6's recurrent (SSM) state can't be partially rewound, so multi-turn reuse needs a context checkpoint at/before the divergence point. Two defaults otherwise force a full re-prefill every turn; both are fixed by flags above:
- Checkpoint cadence. Default
-cpentis 8192, so prompts under 8K never get a usable checkpoint. Fix:-cpent 256 -ctxcp 32 --cache-reuse 256(checkpoint every 256 tokens, keep 32, reuse a matching prefix of ≥256 tokens). Verified: a shared 3,000-token prefix re-prefill dropped 12.4 s → ~0.1 s. - Thinking text breaking the prefix match.
--reasoning-formatcontrols where<think>goes:deepseek(used here) → cleancontent+reasoning_content, auto-paired with--chat-template-kwargs '{"preserve_thinking": true}'so the Jinja template keeps<think>for all turns. Reuse holds if the client echoesreasoning_contentback — and with OpenCode the large stable leading context reuses via checkpoints regardless.none→ leaves<think>inline incontent, so any content-echoing client gets reuse (raw tags show inline).deepseek-legacy/autodo not reuse.
- Vision projector kills reuse. Loading
--mmprojdisables cache reuse entirely; keep vision off for text/code.
--jinja is required so the chat template (and preserve_thinking) apply.
OpenCode + my fork
Point OpenCode at the server as an OpenAI-compatible provider. In single-model mode
llama-server ignores the request's model field, so the client's model name is just a label
(it does not have to match --alias). The provider below is named lmstudio only because it
uses the generic OpenAI-compatible adapter — it points at this llama-server, not LM Studio.
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"lmstudio": {
"npm": "@ai-sdk/openai-compatible",
"name": "local llama-server (ROCmFP4)",
"options": { "baseURL": "http://<host>:8080/v1", "apiKey": "sk-local" },
"models": {
"qwen3.6-27b-mtp": {
"name": "Qwen 3.6 27B",
"limit": { "context": 262144, "output": 32768 }
}
}
}
},
"model": "lmstudio/qwen3.6-27b-mtp",
"compaction": { "auto": true, "reserved": 16384 }
}
Project-local opencode.json — disable the task tool so agents don't spawn subagents, keeping
the whole session on one cache-friendly context:
{
"$schema": "https://opencode.ai/config.json",
"agent": {
"build": { "tools": { "task": false } },
"plan": { "tools": { "task": false } }
}
}
The fork: PlunderStruck/opencode. compaction.auto
summarizes history when the context fills — which in stock OpenCode rewrites the leading prompt
and invalidates the cache, forcing a full re-prefill. This fork compacts without breaking the
cached prefix (plus a few other adjustments), so cache reuse survives compaction. Paired with the
checkpoint flags above, long sessions stay fast and actually usable.
- Downloads last month
- 371
16-bit
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF", filename="", )