Instructions to use unsloth/Qwen3-Coder-Next-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use unsloth/Qwen3-Coder-Next-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="unsloth/Qwen3-Coder-Next-GGUF",
	filename="BF16/Qwen3-Coder-Next-BF16-00001-of-00004.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use unsloth/Qwen3-Coder-Next-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M

Use Docker

docker model run hf.co/unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M

LM Studio
Jan

vLLM

How to use unsloth/Qwen3-Coder-Next-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "unsloth/Qwen3-Coder-Next-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Qwen3-Coder-Next-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M

Ollama
How to use unsloth/Qwen3-Coder-Next-GGUF with Ollama:
```
ollama run hf.co/unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M
```

Unsloth Studio

How to use unsloth/Qwen3-Coder-Next-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/Qwen3-Coder-Next-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/Qwen3-Coder-Next-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for unsloth/Qwen3-Coder-Next-GGUF to start chatting

How to use unsloth/Qwen3-Coder-Next-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use unsloth/Qwen3-Coder-Next-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use unsloth/Qwen3-Coder-Next-GGUF with Docker Model Runner:
```
docker model run hf.co/unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M
```

Lemonade

How to use unsloth/Qwen3-Coder-Next-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_M

Run and chat with the model

lemonade run user.Qwen3-Coder-Next-GGUF-UD-Q4_K_M

List all available models

lemonade list

Feb 19: Qwen3-Coder-Next GGUFs update - much better outputs!

pinned

by danielhanchen - opened Feb 4

Discussion

danielhanchen

Unsloth AI org Feb 4

•

edited Feb 24

llama.cpp has fixed a bug which caused the model to loop and produce poor outputs. The calculation for vectorized key_gdiff has been corrected.
Thanks to the work of llama.cpp and contributors, we have now have reconverted and re-uploaded the model.

Feb 19 update: Tool-calling should now be even better after llama.cpp fixes parsing.
Quantization benchmarks: See third-party Aider, LiveCodeBench v6, MMLU Pro, GPQA benchmarks for GGUFs here.

Please re-download and update llama.cpp thanks!

All have now been updated.

See file history for last updated ones.

Please let us know if you see an improvement!
Q8, MXFP4, F16 are not updated however, you still must update llama.cpp.

We also made a new tutorial on running our dynamic FP8 quant and have a new MXFP4 GGUF.

Guide: https://unsloth.ai/docs/models/qwen3-coder-next

qwen3-coder-next fixed

danielhanchen pinned discussion Feb 4

danielhanchen changed discussion title from Feb 4: Qwen3-Coder-Next GGUFs reuploaded - much better outputs! to Feb 4: Qwen3-Coder-Next GGUFs reuploaded - much better outputs! (Still in progress) Feb 4

Gallardo994

Feb 4

Neither MXFP4 nor Q8 variants have been updated, is this intended or should we expect an update for those quants as well? Thanks for your hard work!

Reverger

Feb 4

Looks marvelous...
Any plans to roll out a REAP version?
Your https://huggingface.co/unsloth/GLM-4.7-Flash-REAP-23B-A3B-GGUF has fantastic results.

deleted

Feb 4

•

edited Feb 4

I'm getting a lot of '"filePath"/home/' invalid json syntax in tool calls (with Q6_K_XL in opencode), and looping instead of fixing it (even when told to fix it, annoyingly). Is this why? Will pulling the files down again fix that?

noctrex

Feb 5

Neither MXFP4 nor Q8 variants have been updated, is this intended or should we expect an update for those quants as well? Thanks for your hard work!

Those quants do not use imatrix, so they are fine to be used as-is. Only quants using imatrixes needed to be requantized.

deleted

Feb 5

This comment has been hidden

danielhanchen

Unsloth AI org Feb 5

Neither MXFP4 nor Q8 variants have been updated, is this intended or should we expect an update for those quants as well? Thanks for your hard work!

Those are not imatrix so it's not needed

danielhanchen

Unsloth AI org Feb 5

Those quants do not use imatrix, so they are fine to be used as-is. Only quants using imatrixes needed to be requantized.

But Q2-Q6 are re-uploaded too. They don't use imatrix I think. I don't understand.

They are imatrix. Only ones arent are 8bit and above and MXFP$

deleted

Feb 5

This comment has been hidden

CHNtentes

Feb 5

They are imatrix. Only ones arent are 8bit and above and MXFP$

Sorry, I thought only models with initial "I" use imatrix. Is there any docs to better understand quant model naming and imatrix?

It seems that IQ quants require imatrix, and Q quants don't, but you can still use imatrix to improve Q quants accuracy.

deleted

Feb 5

This comment has been hidden

deleted

Feb 5

This comment has been hidden

jibe77

Feb 5

•

edited Feb 7

I'm getting a lot of '"filePath"/home/' invalid json syntax in tool calls (with Q6_K_XL in opencode), and looping instead of fixing it (even when told to fix it, annoyingly). Is this why? Will pulling the files down again fix that?

Same for me with the last version of opencode 1.1.50 and llama.cpp 7941.
The model crashes with this model : Error message: JSON Parse error: Unrecognized token '/']
This problem is specific to this Qwen3-coder-next, because I don't have it with other models.

Edit :
tooling calls fail with opencode, discussion here https://www.reddit.com/r/LocalLLaMA/comments/1qvacqo/does_qwen3codernext_work_in_opencode_currently_or/

Edit (bis) :
I've changed my configuration in opencode and specified the option tool_call and reasoning, now it seems to fix the problem :

"qwen3-coder-next": {

"name": "qwen3-coder-next (local)",

"tool_call": true,

"reasoning": true,

"limit": {

"context": 136608,

"output": 25536

}

thaatz

Feb 5

Latest Q6_K_XL GGUF no longer detects the parameters like the architecture or context length in LM Studio (0.4.1). Previous upload was able to detect these without issue.
MXFP4 GGUF is a little better, but incorrectly lists the model as "512x2.5B" whereas Qwen3-Next flavors (not the coder release) are displayed as "80B-A3B" in LM Studio.

Luke2406

Feb 5

I have this problem too. In LM studio are unrecognized. Previous models fails in agentic mode via continue, right now MXFP4 works well after update

ubergarm

Feb 5

This is the mainline llama.cpp PR in question for those following along at home: https://github.com/ggml-org/llama.cpp/pull/19324

puchuu

Feb 6

For me it works perfectly with the latest llama. Actually, this is the first model that I don't want to reduce the temperature of; it is just perfect.

L29Ah

Feb 6

Will other qwen3next imatrix releases be reuploaded too?

jcaneira

Feb 6

is it me or 1st file is incomplete?

Saving to: ‘Qwen3-Coder-Next-UD-Q8_K_XL-00001-of-00003.gguf’

Qwen3-Coder-Next-UD-Q8_K_XL-00001-of-00003.gg 100%[==============================================================================================>] 5.66M --.-KB/s in 0.05s

2026-02-06 15:26:48 (105 MB/s) - ‘Qwen3-Coder-Next-UD-Q8_K_XL-00001-of-00003.gguf’ saved [5936032/5936032]

jcaneira

Feb 6

is it me or 1st file is incomplete?

Saving to: ‘Qwen3-Coder-Next-UD-Q8_K_XL-00001-of-00003.gguf’

Qwen3-Coder-Next-UD-Q8_K_XL-00001-of-00003.gg 100%[==============================================================================================>] 5.66M --.-KB/s in 0.05s

2026-02-06 15:26:48 (105 MB/s) - ‘Qwen3-Coder-Next-UD-Q8_K_XL-00001-of-00003.gguf’ saved [5936032/5936032]

My bad was using old files references when it was 1/2.

danielhanchen changed discussion title from Feb 4: Qwen3-Coder-Next GGUFs reuploaded - much better outputs! (Still in progress) to Feb 4: Qwen3-Coder-Next GGUFs reuploaded - much better outputs! Feb 6

thaatz

Feb 7

•

edited Feb 7

I have updated llama.cpp and there is still an issue with the Q6_K_XL reupload. Other quants, like Q4_K_XL seem good though.

The issue is that architecture and context length cannot be detected. This prevents the ability to set a context length above default (2048 tokens) and the inability to register what architecture it is prevents the model from being coherent as the backend doesnt do any qwen3next specific things.

Again, this is just an issue with the newly reuploaded Q6_K_XL. I understand that the other quants are working well for everyone. The original upload was working fine before this. Please take a look at the newly uploaded Q6_K_XL.

tanjib12

Feb 8

Luke2406

Feb 8

In LM Studio - Q6, Q6_K_XL, Q8 all of them not working... Only MXFP4

owao

Feb 9

Same, the UD quants seem to suffer from something (I tried the Q3_K_XL and Q4_K_XL ones). MXFP4 is fine for me too

engrtipusultan

Feb 13

Weights are updated again?

owao

Feb 13

Honestly providing a short commit msg instead of the default Upload folder using huggingface_hub would prevent many wondering each time you update the weights... Because honestly, providing no changelog feels a bit like experimenting without knowing what you do and rely only on user feedback. But I know it's not the case lol so I don't understand the underlying reason

owao

Feb 13

•

edited Feb 14

~~And I mean... It's not like you are updating thousands of models each days~~ RUDE... Sorry for this...

Cubes123

Feb 14

Tried the Q8_K_XL quant in LM Studio, doesn't recognise it as MoE and can't select more than 2k context.

danielhanchen

Unsloth AI org Feb 14

Sorry yes we did do an update, it only affects the smaller quants, but in smallish ways - some tensors are upcast a bit more, so it retains a bit more accuracy.

I can ask LM Studion on Q6_K_Xl / Q8_K_XL, but my guess is it's not liking some upcasted F16 layers

deleted

Feb 14

•

edited Feb 14

Thanks, @danielhanchen . For what it's worth, while I agree that a clear changelog (and version tags) would be good, I really appreciate the updates.

Model releases do tend to be sloppily version controlled; to my mind they should be semantically versioned and tagged rigorously in version control, just like any other "software". That said, you ARE actually maintaining the models, like software. There are so many models on here that have invalid chat templates, bad weights, or some other issue (even @nvidia ;/ ), and people are just left to download tens of GB and realise that it doesn't work on their own. So, you're doing the right thing here by fixing the issues, of course :)

Cubes123

Feb 14

Yes, unsloth do a great job.

deleted

Feb 14

They are imatrix. Only ones arent are 8bit and above and MXFP$

Sorry, I thought only models with initial "I" use imatrix. Is there any docs to better understand quant model naming and imatrix?

It seems that IQ quants require imatrix, and Q quants don't, but you can still use imatrix to improve Q quants accuracy.

My understanding is that IQ quants are made with imatrix files, but IQ*.gguf files include the relevant scaling. So unless you're making (or debugging) iq-quants, you don't need the imatrix files.

danielhanchen changed discussion title from Feb 4: Qwen3-Coder-Next GGUFs reuploaded - much better outputs! to Feb 19: Qwen3-Coder-Next GGUFs update - much better outputs! Feb 24

abdy3ad

Mar 9

•

edited Mar 9

I'm using q3kxl and don't like it tbh, the older version was better. Any way to get it back or there isn't a lot of different between q3kxl and q3ks( this hasn't been updated, right?)?

ubergarm

Mar 10

@abdy3ad

This reminds me of when "new coke" was released in 1985 and everyone went nuts for the "old coke":

If they didn't super squash the repo already you could look through the commit history and try to find the old ones.

Grossor

Mar 17

I'm having the same experience. The UD quants I got on the 19th of February work fine for me. The new ones uploaded in early March are slower and produce worse outputs for me (I'm actually getting better results from the old 4QKXL than the new 8QKXL). I had the same problem with Qwen 3.5 35b.

For anyone wanting "old coke" like me ;) You can find the quants here
https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/tree/30261170f7aff4f3d4283dd1f09c8510432aead7

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment