Instructions to use rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF", filename="qwen36-a3b-claude-coder-q4_K_M-llama.cpp.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF:Q4_K_M
Use Docker
docker model run hf.co/rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF:Q4_K_M
- Ollama
How to use rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF with Ollama:
ollama run hf.co/rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF:Q4_K_M
- Unsloth Studio
How to use rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF to start chatting
- Pi
How to use rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF with Docker Model Runner:
docker model run hf.co/rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF:Q4_K_M
- Lemonade
How to use rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.qwen36-a3b-claude-coder-llama.cpp-GGUF-Q4_K_M
List all available models
lemonade list
Qwen3.6 Claude Coder — local MoE coding agent (llama.cpp build)
A custom configuration of Qwen3.6-35B-A3B (Mixture-of-Experts, ~3B active parameters), set up to act as an autonomous coding agent: it uses tools instead of guessing, grounds every answer in the actual tool output (never fabricates results), does not loop on the same tool, and returns complete, runnable code. No-think mode is wired into the system prompt for fast, direct answers. Safety guardrails of the base model are intact.
It drives Claude Code, Codex and opencode fully locally — your code never leaves your machine and cloud token cost drops to zero.
This is the llama.cpp / ik_llama.cpp build. Same behavior and configuration as
rafw007/qwen36-a3b-claude-coderon Ollama — packaged so it loads on stock llama.cpp. See "Why a separate version" below.
Why a separate version (vs. the Ollama one)
The Ollama model and this one share the same agent config (system prompt + sampling params). What differs is packaging and the loader they target:
| Ollama version | This llama.cpp version | |
|---|---|---|
| Runtime | Ollama engine + Modelfile (RENDERER/PARSER qwen3.5) |
stock llama.cpp / ik_llama.cpp (llama-server) |
| Weights | nvfp4 (~21 GB) | GGUF Q4_K_M (~24 GB) |
| Tool format | Ollama's native Qwen parser | GGUF Jinja chat template + --jinja |
| Agent config | baked into the Modelfile | supplied via launch flags + a system-prompt file (below) |
The actual fix. Qwen3.5/3.6-MoE uses multimodal RoPE (mRoPE) whose native
rope.dimension_sections is 3 ints [t, h, w]. Ollama's loader is lenient and accepts that.
Recent stock llama.cpp (the Qwen3.5 loader from PR #19435) validates that key as a length-4
array and rejects the 3-element one:
key qwen35moe.rope.dimension_sections has wrong array length; expected 4, got 3
This is a known, family-wide converter/loader mismatch — not specific to this quant. This GGUF has
the section array padded to length 4 ([11, 11, 10] → [11, 11, 10, 0]; the 4th slot is the unused
text section, it does not change inference), so it loads cleanly on current llama.cpp and
ik_llama.cpp. If you hit the error above with any other Qwen3.5/3.6-MoE GGUF, this is the cause.
What it is (and what it is not)
Honest framing: the weights are stock Qwen3.6-35B-A3B. The "Claude Coder" behavior comes entirely from an agentic system prompt + sampling configuration, plus the llama.cpp-compatibility rope fix described above. Everything here is measured, not marketing.
Quick start (llama.cpp / ik_llama.cpp)
llama-server \
-m qwen36-a3b-claude-coder-q4_K_M-llama.cpp.gguf \
--jinja --reasoning-budget 0 \
-c 65536 \
--temp 0.6 --top-k 20 --top-p 0.8 --repeat-penalty 1 --presence-penalty 0 \
--system-prompt-file qwen36-system.txt \
--host 0.0.0.0 --port 8080
--reasoning-budget 0 enforces no-think. --jinja enables native tool-calling via the embedded
Qwen chat template. qwen36-system.txt is your agent system-prompt file (same configuration as the
Ollama build — its contents are not published).
Tested
End-to-end under opencode against ik_llama.cpp (llama-server, port-bound, --jinja): the
model emitted real tool_calls, executed a real df -h, grounded its answer on the actual output
and exited cleanly (no tool loop). Loads without the rope error on ik_llama.cpp (mRoPE sections
reported as [11, 11, 10, 0]).
Context
- Configured for 64K (Claude Code's recommended minimum). Base Qwen3.6 natively supports 262K, so context can be raised on stronger hardware. On a CPU-only box lower it (e.g. 16–32K) to fit RAM.
Files
| File | Quant | Size | Notes |
|---|---|---|---|
qwen36-a3b-claude-coder-q4_K_M-llama.cpp.gguf |
Q4_K_M | ~24 GB | mRoPE dimension_sections padded to length-4 for stock llama.cpp / ik_llama.cpp. |
How it was made
Designed, built and tested with the help of Claude Opus — the system prompt, parameter choices and context configuration come from that work. The llama.cpp packaging (rope-section fix + launch recipe) was added after a user report that the Ollama-targeted GGUF would not load on stock llama.cpp.
License
Apache 2.0 (inherited from the base Qwen3.6).
- Downloads last month
- 70
4-bit
Model tree for rafw007/qwen36-a3b-claude-coder-llama.cpp-GGUF
Base model
Qwen/Qwen3.6-35B-A3B