Instructions to use unsloth/Qwen3-Coder-Next-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use unsloth/Qwen3-Coder-Next-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="unsloth/Qwen3-Coder-Next-GGUF", filename="BF16/Qwen3-Coder-Next-BF16-00001-of-00004.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use unsloth/Qwen3-Coder-Next-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M # Run inference directly in the terminal: llama-cli -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M # Run inference directly in the terminal: llama-cli -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M
Use Docker
docker model run hf.co/unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M
- LM Studio
- Jan
- vLLM
How to use unsloth/Qwen3-Coder-Next-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "unsloth/Qwen3-Coder-Next-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/Qwen3-Coder-Next-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M
- Ollama
How to use unsloth/Qwen3-Coder-Next-GGUF with Ollama:
ollama run hf.co/unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M
- Unsloth Studio
How to use unsloth/Qwen3-Coder-Next-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/Qwen3-Coder-Next-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/Qwen3-Coder-Next-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for unsloth/Qwen3-Coder-Next-GGUF to start chatting
- Pi
How to use unsloth/Qwen3-Coder-Next-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use unsloth/Qwen3-Coder-Next-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use unsloth/Qwen3-Coder-Next-GGUF with Docker Model Runner:
docker model run hf.co/unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M
- Lemonade
How to use unsloth/Qwen3-Coder-Next-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M
Run and chat with the model
lemonade run user.Qwen3-Coder-Next-GGUF-UD-Q4_K_M
List all available models
lemonade list
Feb 19: Qwen3-Coder-Next GGUFs update - much better outputs!
llama.cpp has fixed a bug which caused the model to loop and produce poor outputs. The calculation for vectorized key_gdiff has been corrected.
Thanks to the work of llama.cpp and contributors, we have now have reconverted and re-uploaded the model.
- Feb 19 update: Tool-calling should now be even better after llama.cpp fixes parsing.
- Quantization benchmarks: See third-party Aider, LiveCodeBench v6, MMLU Pro, GPQA benchmarks for GGUFs here.
Please re-download and update llama.cpp thanks!
All have now been updated.
See file history for last updated ones.
Please let us know if you see an improvement!
Q8, MXFP4, F16 are not updated however, you still must update llama.cpp.
We also made a new tutorial on running our dynamic FP8 quant and have a new MXFP4 GGUF.
Guide: https://unsloth.ai/docs/models/qwen3-coder-next

Neither MXFP4 nor Q8 variants have been updated, is this intended or should we expect an update for those quants as well? Thanks for your hard work!
Looks marvelous...
Any plans to roll out a REAP version?
Your https://huggingface.co/unsloth/GLM-4.7-Flash-REAP-23B-A3B-GGUF has fantastic results.
I'm getting a lot of '"filePath"/home/' invalid json syntax in tool calls (with Q6_K_XL in opencode), and looping instead of fixing it (even when told to fix it, annoyingly). Is this why? Will pulling the files down again fix that?
Neither MXFP4 nor Q8 variants have been updated, is this intended or should we expect an update for those quants as well? Thanks for your hard work!
Those quants do not use imatrix, so they are fine to be used as-is. Only quants using imatrixes needed to be requantized.
Neither MXFP4 nor Q8 variants have been updated, is this intended or should we expect an update for those quants as well? Thanks for your hard work!
Those are not imatrix so it's not needed
Those quants do not use imatrix, so they are fine to be used as-is. Only quants using imatrixes needed to be requantized.
But Q2-Q6 are re-uploaded too. They don't use imatrix I think. I don't understand.
They are imatrix. Only ones arent are 8bit and above and MXFP$
They are imatrix. Only ones arent are 8bit and above and MXFP$
Sorry, I thought only models with initial "I" use imatrix. Is there any docs to better understand quant model naming and imatrix?
It seems that IQ quants require imatrix, and Q quants don't, but you can still use imatrix to improve Q quants accuracy.
I'm getting a lot of '"filePath"/home/' invalid json syntax in tool calls (with Q6_K_XL in opencode), and looping instead of fixing it (even when told to fix it, annoyingly). Is this why? Will pulling the files down again fix that?
Same for me with the last version of opencode 1.1.50 and llama.cpp 7941.
The model crashes with this model : Error message: JSON Parse error: Unrecognized token '/']
This problem is specific to this Qwen3-coder-next, because I don't have it with other models.
Edit :
tooling calls fail with opencode, discussion here https://www.reddit.com/r/LocalLLaMA/comments/1qvacqo/does_qwen3codernext_work_in_opencode_currently_or/
Edit (bis) :
I've changed my configuration in opencode and specified the option tool_call and reasoning, now it seems to fix the problem :
"qwen3-coder-next": {
"name": "qwen3-coder-next (local)",
"tool_call": true,
"reasoning": true,
"limit": {
"context": 136608,
"output": 25536
}
}
Latest Q6_K_XL GGUF no longer detects the parameters like the architecture or context length in LM Studio (0.4.1). Previous upload was able to detect these without issue.
MXFP4 GGUF is a little better, but incorrectly lists the model as "512x2.5B" whereas Qwen3-Next flavors (not the coder release) are displayed as "80B-A3B" in LM Studio.
I have this problem too. In LM studio are unrecognized. Previous models fails in agentic mode via continue, right now MXFP4 works well after update
This is the mainline llama.cpp PR in question for those following along at home: https://github.com/ggml-org/llama.cpp/pull/19324
For me it works perfectly with the latest llama. Actually, this is the first model that I don't want to reduce the temperature of; it is just perfect.
Will other qwen3next imatrix releases be reuploaded too?
is it me or 1st file is incomplete?
Saving to: βQwen3-Coder-Next-UD-Q8_K_XL-00001-of-00003.ggufβ
Qwen3-Coder-Next-UD-Q8_K_XL-00001-of-00003.gg 100%[==============================================================================================>] 5.66M --.-KB/s in 0.05s
2026-02-06 15:26:48 (105 MB/s) - βQwen3-Coder-Next-UD-Q8_K_XL-00001-of-00003.ggufβ saved [5936032/5936032]
is it me or 1st file is incomplete?
Saving to: βQwen3-Coder-Next-UD-Q8_K_XL-00001-of-00003.ggufβ
Qwen3-Coder-Next-UD-Q8_K_XL-00001-of-00003.gg 100%[==============================================================================================>] 5.66M --.-KB/s in 0.05s
2026-02-06 15:26:48 (105 MB/s) - βQwen3-Coder-Next-UD-Q8_K_XL-00001-of-00003.ggufβ saved [5936032/5936032]
My bad was using old files references when it was 1/2.
I have updated llama.cpp and there is still an issue with the Q6_K_XL reupload. Other quants, like Q4_K_XL seem good though.
The issue is that architecture and context length cannot be detected. This prevents the ability to set a context length above default (2048 tokens) and the inability to register what architecture it is prevents the model from being coherent as the backend doesnt do any qwen3next specific things.
Again, this is just an issue with the newly reuploaded Q6_K_XL. I understand that the other quants are working well for everyone. The original upload was working fine before this. Please take a look at the newly uploaded Q6_K_XL.
In LM Studio - Q6, Q6_K_XL, Q8 all of them not working... Only MXFP4
Same, the UD quants seem to suffer from something (I tried the Q3_K_XL and Q4_K_XL ones). MXFP4 is fine for me too
Weights are updated again?
Honestly providing a short commit msg instead of the default Upload folder using huggingface_hub would prevent many wondering each time you update the weights... Because honestly, providing no changelog feels a bit like experimenting without knowing what you do and rely only on user feedback. But I know it's not the case lol so I don't understand the underlying reason
And I mean... It's not like you are updating thousands of models each days RUDE... Sorry for this...
Tried the Q8_K_XL quant in LM Studio, doesn't recognise it as MoE and can't select more than 2k context.
Sorry yes we did do an update, it only affects the smaller quants, but in smallish ways - some tensors are upcast a bit more, so it retains a bit more accuracy.
I can ask LM Studion on Q6_K_Xl / Q8_K_XL, but my guess is it's not liking some upcasted F16 layers
Thanks, @danielhanchen . For what it's worth, while I agree that a clear changelog (and version tags) would be good, I really appreciate the updates.
Model releases do tend to be sloppily version controlled; to my mind they should be semantically versioned and tagged rigorously in version control, just like any other "software". That said, you ARE actually maintaining the models, like software. There are so many models on here that have invalid chat templates, bad weights, or some other issue (even @nvidia ;/ ), and people are just left to download tens of GB and realise that it doesn't work on their own. So, you're doing the right thing here by fixing the issues, of course :)
Yes, unsloth do a great job.
They are imatrix. Only ones arent are 8bit and above and MXFP$
Sorry, I thought only models with initial "I" use imatrix. Is there any docs to better understand quant model naming and imatrix?
It seems that IQ quants require imatrix, and Q quants don't, but you can still use imatrix to improve Q quants accuracy.
My understanding is that IQ quants are made with imatrix files, but IQ*.gguf files include the relevant scaling. So unless you're making (or debugging) iq-quants, you don't need the imatrix files.
I'm using q3kxl and don't like it tbh, the older version was better. Any way to get it back or there isn't a lot of different between q3kxl and q3ks( this hasn't been updated, right?)?
I'm having the same experience. The UD quants I got on the 19th of February work fine for me. The new ones uploaded in early March are slower and produce worse outputs for me (I'm actually getting better results from the old 4QKXL than the new 8QKXL). I had the same problem with Qwen 3.5 35b.
For anyone wanting "old coke" like me ;) You can find the quants here
https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/tree/30261170f7aff4f3d4283dd1f09c8510432aead7

