Comparison with Q4_K from ggml-org?

#11

by tarruda - opened Feb 8

Feb 8

Amazing IQ4_XS quants, thanks for it!

Is it possible to add https://huggingface.co/ggml-org/Step-3.5-Flash-GGUF/blob/main/Step-3.5-Flash-Q4_K.gguf to the perplexity comparison chart?

ubergarm

Owner Feb 8

@tarruda

Yes I saw your message on the other discussion, glad you're enjoying it! I've had a few good reviews for it including agentic use now too!

I did see that one, yeah I should have time to add it to my chart today. I'll start downloading it on the remote rig.

ubergarm

Owner Feb 8

Okay, just added the ggml-org/Step-3.5-Flash-GGUF Q4_K_M 110.553 GiB (4.822 BPW) perplexity data to the graph on the model card. Surprisingly, it isn't benchmarking any better despite being larger.

I didn't look into the exact recipe its using (likely the default Q4_K_M recipe built into llama-quantize originally for dense models [psure by ik?])... anyway, my custom recipes tend to prefer larger attn/shexp/dense layers for MoEs given those are always active for every token, while the routed experts can be quantized more severely given only 8 of them are used per token. The inference speed could be a little faster though using smaller active tensor sizes due to the usual memory bandwidth limitation of inferencing.

Cool!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment