Looking forward to IQ4_XS!
Your IQ4_XS quants for Step-3.5-Flash have been the best!
Thanks! I've seen you around reddit and such, thanks for sharing the good word!
I might be able to do a similar quant at iq4_xs here for the mainline folks. I generally don't do mainline and stick to ik, but i don't think anyone else does the same iq4_xs recipe that i have. AesSedai will likely have some solid mainline quants out tonight I'm guessing too.
Oh hey, have you tried pi.dev or the newer oh-my-pi coding agentic harness instead of opencode etc? someone sent me this blog and I've not tried it yet: http://blog.can.ac/2026/02/12/the-harness-problem/
reminds me of this great clip:
https://www.tiktok.com/@startupcode.net/video/7605360360727547150
Oh hey, have you tried pi.dev
Funny that you ask, I just installed and tested pi.dev today for the first time.
Didn't do much with it yet, but it seems pretty good with a small initial context usage, which is great to speed up initial responses from LLMs like Step 3.5 Flash.
Gonna have a look at the oh-my-pi fork, thanks for sharing!
can you make an IQ4_NL?
In my testing IQ4_NL ran a bit faster than Q4_K quants and perplexity-wise it's a bit better than Q4K_S with the same memory footprint. Q4K_S/M would work for me as well.
Hei John, a million thanks for the effort man! πβοΈ
Huh, what is your inference rig setup? Full GPU? Hybrid CPU+GPU? You using mainline or ik or something downstream?
Q4K_S with the same memory footprint. Q4K_S/M would work for me as well.
Oh, i don't use any of the "normal recipes" and only do custom. So my attn.* are all full q8_0 actually which is much better than the default recipe like Q4_K_S or Q4_K_M etc. So it isn't exactly comparable unless you just benchmark yourself to find out.
I use llama-sweep-bench and make some graphs for all my comparisons, and keep a branch here if you want to test against mainline: https://github.com/ubergarm/llama.cpp/tree/ug/port-sweep-bench
I'll try to get some speed benchmarks up eventually, looking for a good ~75ish GiB size that could run on 96GB VRAM with ~128k context.
In limited testing the smol-IQ3_KS is working well with opencode right now so that is a good sign!
Am happy about any custom quants as well of course! And yes, keeping attention a bit higher is usually worth it.
I'm running mainline llama.cpp with a 7900xtx (24gb) and 128gb ddr5 ram with vulkan (ik_llama.cpp didn't work well last time I tried it and also didn't support the new quants for vulkan). As long as it fits my setup and has cpu friendly quants for the routed experts, i'll gladly take it!
Just out of curiosity, is there a reason not to use IQ4_NL? At least in comparison to mainline K quants. In my own testing, it was consistently the best option and when it comes to standard recipes, it was close to Q4K_M while being at the size of Q4K_S.
Ahh okay, you are using AMD GPU with vulkan backend. Correct, ik_llama.cpp does support Vulkan but only for older quantization types (not the newest SOTA quants). If you're doing hybrid CPU + GPU you can make a custom quant using Vulkan optimized types for offload, and CPU optimized types for CPU. I don't release anything that specific though, but you could adapt my recipes and use my imatrix if you like.
Just out of curiosity, is there a reason not to use IQ4_NL? At least in comparison to mainline K quants. In my own testing, it was consistently the best option and when it comes to standard recipes, it was close to Q4K_M while being at the size of Q4K_S.
If you're finding iq4_nl quanted tensors to be beating q4_K quanted tensors and giving you better perplexity and speed for your specific offload strategies, then that is totally fine! iq4_nl is kind of a precursor to the later ik_llama.cpp specific types probably. I'm just surprised that for routed exps (mostly CPU inferencing) that it has better speed optimizations is all.
hola UBG, before the avalanche of ppl requesting quants, do you think you'll make one for people with your and my setup of 48gb of vram and *waves hands* RAM? I really appreciated your stepfun quant IQ4_KSS, I am getting like 16 tps / 150 pp with it.
I know compute isn't free, if you want to drop a recipe I'll see if I can do it with my 64gb ddr4 / 2x 3090 rig if you're tied up. thanks again boss
Let's see you have 48GB VRAM + 64 GB DDR4 = 112GB
The smol-IQ3_KS 87.237 GiB (3.277 BPW) would work for you and give you plenty of context for agentic programming. I've tested it with opencode and seems to be running pretty well.
I may consider a ~4.0BPW smol-IQ4_KSS but depends on the resulting perplexity if I release it or not. I'll at least give it a try soon.
@ubergarm the speed increase isn't very high. about 5%, maybe a bit higher. but since in my testing kld was slightly better as well when comparing same-size standard quants i have come to prefer that kind of quant. i tried IQ4_XS quants as well, but those were slow for me on cpu, roughly a 20-25% speed drop. I would be happy with any kind of Q4 K quant recipe if you feel like cooking it up. No worries if it's not a priority, i can wait for someone to do that kind of quant down the line as well. Making my own quant is an option too, but my internet isn't the best and i am also short on storage, so i think i will pass on that this time around. it's usually worthwhile for smaller quants where some customizing can get significant gains, but in the Q4 range i'm not so picky (as long as it's a cpu friendly quant for the experts).
i tried IQ4_XS quants as well, but those were slow for me on cpu, roughly a 20-25% speed drop.
yeah, unforunately iq4_xs hasn't been as optimized on mainline llama.cpp for non CUDA folks. I've heard some mac folks complaining too hah..
I would be happy with any kind of Q4 K quant recipe if you feel like cooking it up.
I'll leave that to the usual mainline suspects, bartowski, AesSedai, and a new-comer, @ox-ox has some fine looking vanilla recipes from mainline llama.cpp as well over at https://huggingface.co/ox-ox/MiniMax-M2.5-GGUF
I'll noodle more on IQ4_NL given your description it might be one of the most solid options for mainline vulkan users interestingly...
@ubergarm Thanks for the shoutout! I appreciate the trust.
@LagOps Welcome! I focus on standard mainline recipes to ensure maximum compatibility across backends (Metal, Vulkan, etc).
Since you have a beastly setup (128GB RAM + 24GB VRAM), you have way more headroom than my 128GB Unified Memory. My Q3_K_L will fly on your rig, but if you really want to saturate your memory with a standard Q4_K_M (approx ~132GB), let me knowβI can fire up the stove and cook one for you tonight.
Well who would have thunk it?! The IQ4_NL has the lowest perplexity lol... Not 100% sure why or if it is technically better or not (didn't check the KLD stats or anything). It is a bit bigger than the IQ4_XS too and probably too big for a 128GB even rig at IQ4_NL 121.386 GiB (4.559 BPW)
Should be done uploading any moment now!
Oh wait, if you're on vulkan i think you'd need the mainline compat version just because those darn token_embd and output tensors... well, i guess i can upload the mainline compat version too lol why not i already made it... one sec..
Okay both versions are available, you should pick the mainline-IQ4_NL probably which uses token_embd@q4_K and output@q6_K so gucci for vulkan!
Well, the smol-IQ4_KSS 108.671 GiB (4.082 BPW) looks like pretty good perplexity for the size so I decided to ship it. Probably good for folks with total 128GB and CUDA, but likely too tight for enough context on your rig.
The smaller sizes will probably have to do then and quantize kv-cache with -khad -ctk q6_0 -ctv q8_0 is a decent trade-off probably.
Thanks for looking into it and for providing the quant! The fit is no issue as there's the gpu as well. I could even push 5-10 gb more with some squeezing with 32k context. And yeah, good to see you be able to reproduce those results (at least on ppl, but kld was slightly better for me as well). It's true that IQ4_XS is a bit smaller, but as i mentioned, it's not a cpu-friendly quant and out of the cpu-friendly mainline quants (well just Q4_K, really), IQ4_NL was clearly the best option from my tesing.
Edit: the IQ4_NL you cooked up has some crazy good PPL, i don't think kld would be quite that amazing (don't think it beats Q5_K), but still...
It might be worth increasing the quant for input/output for future quants as it seems like there could be greater gains (as the ik quant version is better by a significant margin). for such a large model, spending a bit more on those tensors isn't overly costly.