Instructions to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF", filename="Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16 # Run inference directly in the terminal: llama-cli -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16 # Run inference directly in the terminal: llama-cli -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16 # Run inference directly in the terminal: ./llama-cli -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Use Docker
docker model run hf.co/plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
- LM Studio
- Jan
- Ollama
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Ollama:
ollama run hf.co/plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
- Unsloth Studio
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF to start chatting
- Pi
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Docker Model Runner:
docker model run hf.co/plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
- Lemonade
How to use plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF:BF16
Run and chat with the model
lemonade run user.Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF-BF16
List all available models
lemonade list
llm.create_chat_completion(
messages = "No input example has been defined for this model task."
)Qwen3.6-35B-A3B-MTP — ROCmFP4 STRIX (f16 embeddings)
Experimental AMD Strix Halo (gfx1151) quant of Qwen3.6-35B-A3B — a mixture-of-experts model (~3B active params) with the built-in MTP / next-token-prediction head — in the custom ROCmFP4 4-bit format. Tuned for high MTP draft acceptance and long-context, multi-turn use.
Requires the ROCmFP4 fork (public) — not stock llama.cpp
This file uses the ROCmFP4 tensor types (
q4_0_rocmfp4,q4_0_rocmfp4_fast). Stock llama.cpp, LM Studio, Ollama, Jan, koboldcpp, etc. cannot load it. Build and run it with the public forkcharlie12345/rocmfp4-llama:git clone https://github.com/charlie12345/rocmfp4-llama cd rocmfp4-llama && git checkout mtp-rocmfp4-strix env JOBS=16 scripts/build-strix-rocmfp4-mtp.sh
Part 1 — The model
What this is
- Base:
unsloth/Qwen3.6-35B-A3B-MTP-GGUFBF16, pinned at revision5bc3e238d916f48a861bac2f8a1990a0e9b7e98d. Archqwen35moe: 41 blocks, 2048 hidden, 256 experts, with thenextn_predict_layers=1MTP head (blk.40.nextn.*), so self-speculative draft-MTP survives quantization. - Format: ROCmFP4 — a 4-bit weight format for AMD using an FP4-derived value codebook plus one (FAST) or two (dual) UE4M3/FP8 scale bytes per 32-weight block. Tensor-aware.
- This variant (
STRIX-embF16): quality-biased STRIX preset + f16 token embeddings (full precision; it's a lookup, so ~zero decode cost). No imatrix.
| value | |
|---|---|
| File | Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16.gguf |
| Size / bpw | 18.8 GB / 4.44 bpw |
| token_embd | F16 |
MoE router (ffn_gate_inp) |
F32 (full precision, kept automatically) |
experts (ffn_*_exps) |
q4_0_rocmfp4_fast (custom kernel) |
| attention K/V (+ fused QKV) | q4_0_rocmfp4 (dual-scale) |
| MTP head | preserved (blk.40.nextn.*) |
The router stays F32 for free. The quantizer excludes expert-gating tensors from quantization, so routing (which experts each token goes to — a discrete, high-sensitivity decision) keeps full precision automatically, while the experts run on the custom ROCmFP4 kernel.
How it was built (reproducible)
llama-quantize \
--token-embedding-type f16 \
Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf \
Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16.gguf \
Q4_0_ROCMFP4_STRIX
Status & caveats
Experimental research build. Results are hardware-, driver-, model-, and prompt-sensitive, and tuned for AMD Strix Halo — they may not reproduce on other GPUs. This is not native FP4 tensor-core execution. Do not treat these numbers as upstream llama.cpp claims.
Credits & license
- Base model: Qwen3.6-35B-A3B (Qwen team) — a derivative quantization that inherits the base model's license; verify the original Qwen3.6 terms before redistribution/use.
- BF16 GGUF source:
unsloth/Qwen3.6-35B-A3B-MTP-GGUF@5bc3e238d916f48a861bac2f8a1990a0e9b7e98d. - ROCmFP4 format & runtime:
charlie12345/rocmfp4-llama(based on llama.cpp, MIT).
Part 2 — Making practical use of it
What I observed
Hands-on, on a Framework Desktop / AMD Ryzen AI Max+ 395 (gfx1151, 128 GB unified, ROCm 7.2.0):
- It runs great on the ROCmFP4 fork — 78–90 t/s decode, MTP draft acceptance ~0.6–0.95 (content-dependent), coherent, and it loads at full 262144 context with ~92 GB free (the KV footprint is modest). MoE decode is naturally fast (only ~3B params active per token), and the F32 router keeps expert selection clean.
- The companion 27B dense quant (same recipe) is at
plunderstruck/Qwen3.6-27B-MTP-ROCmFP4-GGUF.
Run config (highest MTP acceptance on Strix Halo)
Full-precision (f16) KV is the dominant acceptance lever; 128 GB unified affords it (drop to
-ctk q8_0 -ctv q8_0 on less memory).
env HSA_OVERRIDE_GFX_VERSION=11.5.1 GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \
llama-server -m Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16.gguf \
--alias qwen3.6-35b-a3b-rocmfp4-mtp --host 0.0.0.0 --port 8080 \
-dev Vulkan0 -ngl 999 -fa on \
-c 262144 -b 2048 -ub 256 -t 16 -tb 16 \
-ctk f16 -ctv f16 \
-cpent 256 -ctxcp 32 --cache-reuse 256 \
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \
--presence-penalty 0.0 --repeat-penalty 1.0 \
--spec-type draft-mtp --spec-draft-device Vulkan0 --spec-draft-ngl all \
--spec-draft-type-k f16 --spec-draft-type-v f16 \
--spec-draft-n-max 3 --spec-draft-n-min 0 --spec-draft-p-min 0.0 --spec-draft-p-split 0.10 \
--reasoning on --reasoning-format deepseek \
--chat-template-kwargs '{"preserve_thinking": true}' \
--jinja --parallel 1 --metrics --no-mmap
-dev Vulkan0 (KHR_coopmat) beats ROCm here; -ub 256 is the prefill optimum; --spec-type draft-mtp uses the model's built-in MTP head. Temp 0.6 = Qwen3.6 "precise coding" (1.0 for general).
Multi-turn prompt-cache reuse (OpenCode)
Qwen3.6's recurrent state can't partial-rewind, so multi-turn reuse needs a context checkpoint. Two defaults otherwise force a full re-prefill every turn; both are fixed above:
- Checkpoints — default
-cpentis 8192, so prompts under 8K never checkpoint. Fix:-cpent 256 -ctxcp 32 --cache-reuse 256. - Thinking —
--reasoning-format deepseek+--chat-template-kwargs '{"preserve_thinking": true}'keeps<think>across turns with cleancontent+reasoning_content. (none= raw tags inline but works with any content-echoing client;deepseek-legacy/autodo not reuse.) - Vision —
--mmprojdisables cache reuse; keep it off for text/code.
--jinja is required for the chat template + preserve_thinking.
OpenAI-compatible client (e.g. OpenCode)
Point the client at the server. In single-model mode llama-server ignores the request's
model field, so the client's model name is just a label.
- Base URL:
http://<host>:8080/v1· API key: any non-empty string (e.g.sk-local) - Model id (what this server reports):
qwen3.6-35b-a3b-rocmfp4-mtp
A patched OpenCode that compacts conversation history without invalidating the prompt cache is
at PlunderStruck/opencode — pair it with the
checkpoint flags to keep long sessions fast.
- Downloads last month
- 195
16-bit
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="plunderstruck/Qwen3.6-35B-A3B-MTP-ROCmFP4-GGUF", filename="Qwen3.6-35B-A3B-MTP-ROCmFP4-STRIX-embF16.gguf", )