IQ4_XS vs IQ4_KSS

#7
by Arkovski - opened

Hi bro,
great job with the Quants :)

But my question is, what differs KSS versus the unsloth XS version?
The size diff is 111gb vs 117gb, so what are the results in benchmarks and in real world tasks/coding?

Thanks!

But my question is, what differs KSS versus the unsloth XS version?

The boring nuanced answer is that you can look inside the quant to see each tensor using the huggingface gguf browser for the UD quant and look at my "secret recipe" details fold here: https://huggingface.co/ubergarm/MiniMax-M2.7-GGUF#smol-iq4_kss-108671-gib-4082-bpw

If you're using strix halo amd, or mac, I'd consider checking out https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF as those recipes are MoE optimized similar to mine, but work on mainline llama.cpp.

If you're on CUDA full offload or hybrid CPU+CUDA then try mine with ik_llama.cpp

The size diff is 111gb vs 117gb, so what are the results in benchmarks and in real world tasks/coding?

The iq4_kss is one of ik's newest SOTA quantization types which can squeeze down the model a little more over his older iq4_xs quantization type. fwiw these are only used for the routed experts anyway, i leave the rest of the model full Q8_0 which is the "secret sauce" for MoE optimization.

Really both will likely perform similarly, the important thing is how much kv-cache can you fit on your rig while keeping the entire thing in VRAM etc.

Sign up or log in to comment