Qwen3.6 Claude Coder — local MoE coding agent (llama.cpp build)

A custom configuration of Qwen3.6-35B-A3B (Mixture-of-Experts, ~3B active parameters), set up to act as an autonomous coding agent: it uses tools instead of guessing, grounds every answer in the actual tool output (never fabricates results), does not loop on the same tool, and returns complete, runnable code. No-think mode is wired into the system prompt for fast, direct answers. Safety guardrails of the base model are intact.

It drives Claude Code, Codex and opencode fully locally — your code never leaves your machine and cloud token cost drops to zero.

This is the llama.cpp / ik_llama.cpp build. Same behavior and configuration as rafw007/qwen36-a3b-claude-coder on Ollama — packaged so it loads on stock llama.cpp. See "Why a separate version" below.

Why a separate version (vs. the Ollama one)

The Ollama model and this one share the same agent config (system prompt + sampling params). What differs is packaging and the loader they target:

Ollama version This llama.cpp version
Runtime Ollama engine + Modelfile (RENDERER/PARSER qwen3.5) stock llama.cpp / ik_llama.cpp (llama-server)
Weights nvfp4 (~21 GB) GGUF Q4_K_M (~24 GB)
Tool format Ollama's native Qwen parser GGUF Jinja chat template + --jinja
Agent config baked into the Modelfile supplied via launch flags + a system-prompt file (below)

The actual fix. Qwen3.5/3.6-MoE uses multimodal RoPE (mRoPE) whose native rope.dimension_sections is 3 ints [t, h, w]. Ollama's loader is lenient and accepts that. Recent stock llama.cpp (the Qwen3.5 loader from PR #19435) validates that key as a length-4 array and rejects the 3-element one:

key qwen35moe.rope.dimension_sections has wrong array length; expected 4, got 3

This is a known, family-wide converter/loader mismatch — not specific to this quant. This GGUF has the section array padded to length 4 ([11, 11, 10] → [11, 11, 10, 0]; the 4th slot is the unused text section, it does not change inference), so it loads cleanly on current llama.cpp and ik_llama.cpp. If you hit the error above with any other Qwen3.5/3.6-MoE GGUF, this is the cause.

What it is (and what it is not)

Honest framing: the weights are stock Qwen3.6-35B-A3B. The "Claude Coder" behavior comes entirely from an agentic system prompt + sampling configuration, plus the llama.cpp-compatibility rope fix described above. Everything here is measured, not marketing.

Quick start (llama.cpp / ik_llama.cpp)

llama-server \
  -m qwen36-a3b-claude-coder-q4_K_M-llama.cpp.gguf \
  --jinja --reasoning-budget 0 \
  -c 65536 \
  --temp 0.6 --top-k 20 --top-p 0.8 --repeat-penalty 1 --presence-penalty 0 \
  --system-prompt-file qwen36-system.txt \
  --host 0.0.0.0 --port 8080

--reasoning-budget 0 enforces no-think. --jinja enables native tool-calling via the embedded Qwen chat template. qwen36-system.txt is your agent system-prompt file (same configuration as the Ollama build — its contents are not published).

Tested

End-to-end under opencode against ik_llama.cpp (llama-server, port-bound, --jinja): the model emitted real tool_calls, executed a real df -h, grounded its answer on the actual output and exited cleanly (no tool loop). Loads without the rope error on ik_llama.cpp (mRoPE sections reported as [11, 11, 10, 0]).

Context

  • Configured for 64K (Claude Code's recommended minimum). Base Qwen3.6 natively supports 262K, so context can be raised on stronger hardware. On a CPU-only box lower it (e.g. 16–32K) to fit RAM.

Files

File Quant Size Notes
qwen36-a3b-claude-coder-q4_K_M-llama.cpp.gguf Q4_K_M ~24 GB mRoPE dimension_sections padded to length-4 for stock llama.cpp / ik_llama.cpp.

How it was made

Designed, built and tested with the help of Claude Opus — the system prompt, parameter choices and context configuration come from that work. The llama.cpp packaging (rope-section fix + launch recipe) was added after a user report that the Ollama-targeted GGUF would not load on stock llama.cpp.

License

Apache 2.0 (inherited from the base Qwen3.6).

Downloads last month
70
GGUF
Model size
36B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF

Quantized
(452)
this model