Looking forward to IQ4_XS!

#1
by tarruda - opened

Your IQ4_XS quants for Step-3.5-Flash have been the best!

@tarruda

Thanks! I've seen you around reddit and such, thanks for sharing the good word!

I might be able to do a similar quant at iq4_xs here for the mainline folks. I generally don't do mainline and stick to ik, but i don't think anyone else does the same iq4_xs recipe that i have. AesSedai will likely have some solid mainline quants out tonight I'm guessing too.

Oh hey, have you tried pi.dev or the newer oh-my-pi coding agentic harness instead of opencode etc? someone sent me this blog and I've not tried it yet: http://blog.can.ac/2026/02/12/the-harness-problem/

reminds me of this great clip:
https://www.tiktok.com/@startupcode.net/video/7605360360727547150

Oh hey, have you tried pi.dev

Funny that you ask, I just installed and tested pi.dev today for the first time.

Didn't do much with it yet, but it seems pretty good with a small initial context usage, which is great to speed up initial responses from LLMs like Step 3.5 Flash.

Gonna have a look at the oh-my-pi fork, thanks for sharing!

@tarruda

Currently uploading IQ4_XS 114.842 GiB (4.314 BPW) compatible with both ik and mainline llama.cpp

can you make an IQ4_NL?

@LagOps

Heya! In general I only use ik's IQ4_NL only if tensor dimensions don't work with the newer quantization types and require it.

Just curious why you would even want it? I assume most backend implementations for IQ4_NL are less optimized than similar sized q4_0 and q4_K even for vulkan etc.

In my testing IQ4_NL ran a bit faster than Q4_K quants and perplexity-wise it's a bit better than Q4K_S with the same memory footprint. Q4K_S/M would work for me as well.

Hei John, a million thanks for the effort man! πŸ™βœŒοΈ

@LagOps

Huh, what is your inference rig setup? Full GPU? Hybrid CPU+GPU? You using mainline or ik or something downstream?

Q4K_S with the same memory footprint. Q4K_S/M would work for me as well.

Oh, i don't use any of the "normal recipes" and only do custom. So my attn.* are all full q8_0 actually which is much better than the default recipe like Q4_K_S or Q4_K_M etc. So it isn't exactly comparable unless you just benchmark yourself to find out.

I use llama-sweep-bench and make some graphs for all my comparisons, and keep a branch here if you want to test against mainline: https://github.com/ubergarm/llama.cpp/tree/ug/port-sweep-bench

I'll try to get some speed benchmarks up eventually, looking for a good ~75ish GiB size that could run on 96GB VRAM with ~128k context.

In limited testing the smol-IQ3_KS is working well with opencode right now so that is a good sign!

Am happy about any custom quants as well of course! And yes, keeping attention a bit higher is usually worth it.

I'm running mainline llama.cpp with a 7900xtx (24gb) and 128gb ddr5 ram with vulkan (ik_llama.cpp didn't work well last time I tried it and also didn't support the new quants for vulkan). As long as it fits my setup and has cpu friendly quants for the routed experts, i'll gladly take it!

Just out of curiosity, is there a reason not to use IQ4_NL? At least in comparison to mainline K quants. In my own testing, it was consistently the best option and when it comes to standard recipes, it was close to Q4K_M while being at the size of Q4K_S.

@LagOps

Ahh okay, you are using AMD GPU with vulkan backend. Correct, ik_llama.cpp does support Vulkan but only for older quantization types (not the newest SOTA quants). If you're doing hybrid CPU + GPU you can make a custom quant using Vulkan optimized types for offload, and CPU optimized types for CPU. I don't release anything that specific though, but you could adapt my recipes and use my imatrix if you like.

Just out of curiosity, is there a reason not to use IQ4_NL? At least in comparison to mainline K quants. In my own testing, it was consistently the best option and when it comes to standard recipes, it was close to Q4K_M while being at the size of Q4K_S.

If you're finding iq4_nl quanted tensors to be beating q4_K quanted tensors and giving you better perplexity and speed for your specific offload strategies, then that is totally fine! iq4_nl is kind of a precursor to the later ik_llama.cpp specific types probably. I'm just surprised that for routed exps (mostly CPU inferencing) that it has better speed optimizations is all.

hola UBG, before the avalanche of ppl requesting quants, do you think you'll make one for people with your and my setup of 48gb of vram and *waves hands* RAM? I really appreciated your stepfun quant IQ4_KSS, I am getting like 16 tps / 150 pp with it.

I know compute isn't free, if you want to drop a recipe I'll see if I can do it with my 64gb ddr4 / 2x 3090 rig if you're tied up. thanks again boss

@jpbwin

Let's see you have 48GB VRAM + 64 GB DDR4 = 112GB

The smol-IQ3_KS 87.237 GiB (3.277 BPW) would work for you and give you plenty of context for agentic programming. I've tested it with opencode and seems to be running pretty well.

I may consider a ~4.0BPW smol-IQ4_KSS but depends on the resulting perplexity if I release it or not. I'll at least give it a try soon.

@ubergarm the speed increase isn't very high. about 5%, maybe a bit higher. but since in my testing kld was slightly better as well when comparing same-size standard quants i have come to prefer that kind of quant. i tried IQ4_XS quants as well, but those were slow for me on cpu, roughly a 20-25% speed drop. I would be happy with any kind of Q4 K quant recipe if you feel like cooking it up. No worries if it's not a priority, i can wait for someone to do that kind of quant down the line as well. Making my own quant is an option too, but my internet isn't the best and i am also short on storage, so i think i will pass on that this time around. it's usually worthwhile for smaller quants where some customizing can get significant gains, but in the Q4 range i'm not so picky (as long as it's a cpu friendly quant for the experts).

@LagOps

i tried IQ4_XS quants as well, but those were slow for me on cpu, roughly a 20-25% speed drop.

yeah, unforunately iq4_xs hasn't been as optimized on mainline llama.cpp for non CUDA folks. I've heard some mac folks complaining too hah..

I would be happy with any kind of Q4 K quant recipe if you feel like cooking it up.

I'll leave that to the usual mainline suspects, bartowski, AesSedai, and a new-comer, @ox-ox has some fine looking vanilla recipes from mainline llama.cpp as well over at https://huggingface.co/ox-ox/MiniMax-M2.5-GGUF

I'll noodle more on IQ4_NL given your description it might be one of the most solid options for mainline vulkan users interestingly...

@will do, thanks for the recommendation for quants from @ox-ox - didn't know them by name and held off from downloading.

@LagOps

Yeah, I tested @ox-ox 's Q3_K_L and seems to be working fine. You'll see slightly more TG given the smaller attn.* reduce the Active Weights required to pull from memory for each token generated, but at a slight cost to quality over full q8_0 attn. Its all trade offs in this fun game!

@ubergarm Thanks for the shoutout! I appreciate the trust.

@LagOps Welcome! I focus on standard mainline recipes to ensure maximum compatibility across backends (Metal, Vulkan, etc).

Since you have a beastly setup (128GB RAM + 24GB VRAM), you have way more headroom than my 128GB Unified Memory. My Q3_K_L will fly on your rig, but if you really want to saturate your memory with a standard Q4_K_M (approx ~132GB), let me knowβ€”I can fire up the stove and cook one for you tonight.

@ox-ox Thanks a lot for the offer! I am a bit confused however, Q4_K_M is already available from you.

@LagOps

Well who would have thunk it?! The IQ4_NL has the lowest perplexity lol... Not 100% sure why or if it is technically better or not (didn't check the KLD stats or anything). It is a bit bigger than the IQ4_XS too and probably too big for a 128GB even rig at IQ4_NL 121.386 GiB (4.559 BPW)

Should be done uploading any moment now!

Oh wait, if you're on vulkan i think you'd need the mainline compat version just because those darn token_embd and output tensors... well, i guess i can upload the mainline compat version too lol why not i already made it... one sec..

Okay both versions are available, you should pick the mainline-IQ4_NL probably which uses token_embd@q4_K and output@q6_K so gucci for vulkan!

@jpbwin

Well, the smol-IQ4_KSS 108.671 GiB (4.082 BPW) looks like pretty good perplexity for the size so I decided to ship it. Probably good for folks with total 128GB and CUDA, but likely too tight for enough context on your rig.

The smaller sizes will probably have to do then and quantize kv-cache with -khad -ctk q6_0 -ctv q8_0 is a decent trade-off probably.

@ubergarm

Thanks for looking into it and for providing the quant! The fit is no issue as there's the gpu as well. I could even push 5-10 gb more with some squeezing with 32k context. And yeah, good to see you be able to reproduce those results (at least on ppl, but kld was slightly better for me as well). It's true that IQ4_XS is a bit smaller, but as i mentioned, it's not a cpu-friendly quant and out of the cpu-friendly mainline quants (well just Q4_K, really), IQ4_NL was clearly the best option from my tesing.

Edit: the IQ4_NL you cooked up has some crazy good PPL, i don't think kld would be quite that amazing (don't think it beats Q5_K), but still...

It might be worth increasing the quant for input/output for future quants as it seems like there could be greater gains (as the ik quant version is better by a significant margin). for such a large model, spending a bit more on those tensors isn't overly costly.

Sign up or log in to comment