Quant requests?

#1
by ubergarm - opened

As qwen3next architecture support only landed in ik_llama.cpp more recently along with qwen35moe, I haven't done as many quants here for Qwen3-Coder-Next.

If there is a target size in which you're interested e.g. full offload on 32GB 5090 etc, let me know. This model seems pretty fast with long context which is nice for a zippy local vibe coding experience for more simple code changes and workloads.


I've not been able to get down below ~2.3bpw or so using imatrix as it gives an error. It seems to be able to handle ffn_down_exps but explodes on ffn_gate_exps when providing imatrix, and without imatrix it throws that The result will be garbage, so bailing out. I could probably try to override that and force it to work without imatrix, or maybe go back and run a much larger corpus to collect more importance data as these Qwen models have been difficult for imatrix like this before. I've tested iq1_kt, iq2_kt, and iq1_s which all fail similarly.

[   9/ 843]                blk.0.ssm_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[  10/ 843]                 blk.0.ssm_out.weight - [ 4096,  2048,     1,     1], type =   bf16, Using custom type q8_0 for tensor blk.0.ssm_out.weight converting to q8_0 .. size =    16.00 MiB ->     8.50 MiB
[  11/ 843]           blk.0.ffn_down_exps.weight - [  512,  2048,   512,     1], type =   bf16, Using custom type iq1_kt for tensor blk.0.ffn_down_exps.weight converting to iq1_kt .. size =  1024.00 MiB ->   116.00 MiB
[  12/ 843]           blk.0.ffn_gate_exps.weight - [ 2048,   512,   512,     1], type =   bf16, Using custom type iq1_kt for tensor blk.0.ffn_gate_exps.weight converting to iq1_kt .. Oops: jbest = -1 for cluster 98 with 1160 points
/home/w/projects/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:8334: GGML_ASSERT(false) failed
Oops: jbest = -1 for cluster 162 with 1224 points
/home/w/projects/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:8334: GGML_ASSERT(false) failed
Oops: jbest = -1 for cluster 157 with 872 points

My setup’s a bit odd 😅 I’m running an M4 Pro MacBook with 24GB of RAM, and I’m using ik_cpp (CPU-only) because it lets me load larger models without pushing memory pressure into the yellow (and hitting swap).

Also, thanks so much for the quants. I really appreciate the work you put into them. I’m going to try out both versions and see how they perform on my end!

By the way, are you planning to release Qwen Next Instruct/Thinking quants as well? I’d definitely be interested in trying those out if you do.

Thanks curious to hear how they work out for you, I know the smaller one is probably a touch big for 24GB. Does it have some kinda eGPU too or 24GB total unified memory or something?

By the way, are you planning to release Qwen Next Instruct/Thinking quants as well?

Do you specifically mean these two: https://huggingface.co/collections/Qwen/qwen3-next ?

I was not considering it given there are quite a few available already and I haven't heard as much talk about them, but they might be good candidates for lower RAM+VRAM systems for sure! I'll see if anyone else did ik_llama.cpp quants for them yet or not and think about it

Hi ubergarm,

https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd.it%2Fqwen3-coder-next-oddly-usable-at-aggressive-quantization-v0-q9q4nsw11rkg1.png%3Fwidth%3D3200%26format%3Dpng%26auto%3Dwebp%26s%3D5932e14267173413e275e01539ae4d848ee99077

Could you have a look here? In fact a 32GB version that beats or is similar with IQ3_XXS would be great. Currently I get around 3000 PP and 100-115 TG with partial offloading.

@Dsturb

I assume you have a single 5090 32GB VRAM ? Yes, these recent qwen3next models seem to hold up quite well to quantization given the right recipes. I'll take a look.

@rhinocerosowllegolas

I'm not sure how mac does with turboderp's exllamav3 , but if it works there are probably some of the strongest low BPW EXL3 quants here: https://huggingface.co/turboderp/Qwen3-Next-80B-A3B-Instruct-exl3

IQ3_K would be good, sitting inbetween 2 quants provided.
I also have 5090, running about 1500 for prompt processing, and 50 token generation on 40gb quants.

I'm quite happy with IQ4_KSS you provided, getting 1200 prompt processing, 50 token generation on my PC. Quite usable for local coding agents. Thanks!

@Dsturb @igor255

I did some fishing and came up with something very very tight for 32GB smol-IQ3_KS 30.728 GiB (3.313 BPW). I had to knock down to iq6_k for attn/ssm/shexp but still larger tensors than used by the UD. You can see the perplexity is looking good too on the graph: https://huggingface.co/ubergarm/Qwen3-Coder-Next-GGUF/blob/main/images/perplexity.png

Qwen3-Coder-Next-smol-IQ3_KS.gguf
Prompt 2142.12 t/s
Generation 79.97 t/s

Qwen3-Coder-Next-IQ4_KSS.gguf
Prompt 1280.50 t/s
Generation 50.66 t/s

My results, running with -c 65536 --n-cpu-moe {8 or 18} --no-mmap --no-warmup -ger --merge-qkv -ub 2048 -b 2048 --jinja -ngl 99 -fa on

Windows, Ryzen 5 5700x3d, 32gb ddr4, RTX 5090.

And additionally tested on linux, extra free vram gives a nice boost.
Qwen3-Coder-Next-smol-IQ3_KS.gguf
Prompt 3080 t/s
Generation 108 t/s

Qwen3-Coder-Next-IQ4_KSS.gguf
Prompt 1760 t/s
Generation 72.9 t/s

Hi! @ubergam
My system has 24 GB of unified memory, with roughly 10 GB used by the OS and apps — so about 14 GB available for inference.

For quantized models, I’m seeing:

~12 tokens/sec with IQ4_KSS
~20 tokens/sec with IQ3_KS

~64 tokens/sec PP for both.

I checked the Qwen next quants collection and couldn't find any ik_cpp-based quantized versions (e.g., IQ4_KSS, IQ3_KS, etc.) — which is why I reached out. I’m also pretty sure I can’t run exl3 quants efficiently on CPU alone (no GPU support).

Thanks anyway for the help — really appreciate it!

Let me know if you'd like help finding compatible quants or optimizing for CPU-only inference! (Rephrased by Qwen Coder ;])

Sign up or log in to comment