performance of quantized models
I am looking at various quantized versions of K2.5 in unsloth since my hardware can only hold UD-Q2_K_XL/UD-IQ3_XXS/Q3_K_S. Is there any comparison between the quantized version of kimi k2.5 and other good open-source models like Qwen3? For example, Qwen3-235B-A22B-Instruct-2507 is 470G large which is about the same size as Q3_K_S, but I am not sure if Q3_K_S has a better performance than qwen
I've been out for a week due to life stuff, but hoping to run some perplexity values on quants soon
The best quant available is the "full size" Q4_X by AesSedai here: https://huggingface.co/AesSedai/Kimi-K2.5-GGUF which seems too big for your desired size. Aes' smaller quants should be quite good as well with compatibility with mainline llama.cpp
Hopefully I'll get some smaller ik_llama.cpp quants out soon, which will likely offer the best quality for a given footprint. You can see some earlier perplexity graphs for Kimi-K2-Thinking here: https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF#quant-collection
In my own anecdotal experience, I prefer to use a more quantized larger model (e.g. DeepSeek-V3.2-Speciale or Kimi-K2.5) over a less quantized smaller model (e.g. Qwen3-235B).
Aes' smaller quants should be quite good as well with compatibility with mainline llama.cpp
I had issues with the IQ2_XXS there. Very similar to what this guy is having with the IQ3_S: https://huggingface.co/AesSedai/Kimi-K2.5-GGUF/discussions/4
The unsloth UD-IQ2_XXS has been stable (with the q8_0 mmproj mmproj from Aes' repo), but the embedded template is dodgy, so have to run it with the jukofyork fix here: https://huggingface.co/AesSedai/Kimi-K2.5-GGUF/discussions/1
"full size" Q4_X by AesSedai here: https://huggingface.co/AesSedai/Kimi-K2.5-GGUF
This is probably the perfect way to run it right now, and has the fixed template baked in. I can't run it though -_-!
some smaller ik_llama.cpp quants
π
Thanks for the links, I'm trying to catch up on everything happened in the past week hah...
I have some small quants trickling in now here: https://huggingface.co/ubergarm/Kimi-K2.5-GGUF
I'll get perplexity graph going soon.
I haven't tried the mmproj stuff at all. I did use the most recent updated official chat template, but was having some issues with pydantic-ai tool use, but it works fine with the old Kimi-K2-Thinking chat template jinja so still need to test that more.
All my Kimi-K2.5 quants keep the active weights full q8_0 and only smashing the routed exps so hopefully no looping etc.