Step-3.7-Flash quant support with the MTP GGUF models

#2
by scottgl - opened

I was wondering, have you tried other quants for the Step-3.7-Flash model, other than Step-3.7-Flash-Q4_K_M.gguf? The reason I ask, I would like to run Step-3.7-Flash + MTP, but it will not fit into memory. I'm wondering if anyone has tried other quants with the MTP, such as IQ4_XS or Q4_K_S with one of the MTP GGUF files?

Yes. MTP works with IQ4_XS and Q4_K_S. But on my StrixHalo it is only realistic to use it with IQ4_XS with context > 128k, till -ctkd and -ctvd get fixed (MTP takes too much vram currently because of F16), so I can set q8_0 for draft model cache. See https://github.com/ggml-org/llama.cpp/issues/24040
By the way, I like Bartowski quants, official Stepfun ones are good as well.

Sign up or log in to comment