Comparison with Q4_K from ggml-org?
Amazing IQ4_XS quants, thanks for it!
Is it possible to add https://huggingface.co/ggml-org/Step-3.5-Flash-GGUF/blob/main/Step-3.5-Flash-Q4_K.gguf to the perplexity comparison chart?
Okay, just added the ggml-org/Step-3.5-Flash-GGUF Q4_K_M 110.553 GiB (4.822 BPW) perplexity data to the graph on the model card. Surprisingly, it isn't benchmarking any better despite being larger.
I didn't look into the exact recipe its using (likely the default Q4_K_M recipe built into llama-quantize originally for dense models [psure by ik?])... anyway, my custom recipes tend to prefer larger attn/shexp/dense layers for MoEs given those are always active for every token, while the routed experts can be quantized more severely given only 8 of them are used per token. The inference speed could be a little faster though using smaller active tensor sizes due to the usual memory bandwidth limitation of inferencing.
Cool!