Comparison with Q4_K from ggml-org?

#11
by tarruda - opened

Amazing IQ4_XS quants, thanks for it!

Is it possible to add https://huggingface.co/ggml-org/Step-3.5-Flash-GGUF/blob/main/Step-3.5-Flash-Q4_K.gguf to the perplexity comparison chart?

Owner

@tarruda

Yes I saw your message on the other discussion, glad you're enjoying it! I've had a few good reviews for it including agentic use now too!

I did see that one, yeah I should have time to add it to my chart today. I'll start downloading it on the remote rig.

Owner

Okay, just added the ggml-org/Step-3.5-Flash-GGUF Q4_K_M 110.553 GiB (4.822 BPW) perplexity data to the graph on the model card. Surprisingly, it isn't benchmarking any better despite being larger.

I didn't look into the exact recipe its using (likely the default Q4_K_M recipe built into llama-quantize originally for dense models [psure by ik?])... anyway, my custom recipes tend to prefer larger attn/shexp/dense layers for MoEs given those are always active for every token, while the routed experts can be quantized more severely given only 8 of them are used per token. The inference speed could be a little faster though using smaller active tensor sizes due to the usual memory bandwidth limitation of inferencing.

Cool!

Sign up or log in to comment