IQ2_KS Unusable?

#2
by coughmedicine - opened

using an ik_llamacpp pull from Feb 11th 2026, and
--jinja
--ctx-size 32768
-ub 2048 -b 2048
-ngl 99
--parallel 1
--threads 8

I get spelling errors, infinite repeats, etc,
Very disappointing, but Step Flash does the same at low quants too, but GLM4.{6,7} are rock solid at smol-IQ1_KT. Something about the architectures of the models maybe?

Interesting, I haven't heard other reports about speling errors and infinite repeats.

What client are you using (just the web server built in interface, or something like opencode or pi.dev etc?)

Assuming your client is setup right, and you have a known working chat template jinja, then yes it could possibly be the model arch does not compress as well.

For example GLM's have attn/shexp/first N dense layers... some other models don't have shexp nor first N dense layers but only attn/exps.

fwiw I've had good luck with agentic stuff using my Step-3.5-Flash smol-IQ3_KS 75.934 GiB (3.312 BPW) using ik and the built in chat template.

I'm also currently cooking a new similar sized one here: https://huggingface.co/ubergarm/MiniMax-M2.5-GGUF (tho minimax has only attn/exps too, so will have to see how it does at smaller size).

I've used SillyTavern(text and chat) and open-web-UI, tried a bunch of different temp/topk/etc parameters. I didn't bother using opencode on MiMo at this low quant since it can't even chat reliably.
FlashStep (I used your Step-3.5-Flash smol-IQ3_KS ) has known issues with infinite repeats doing coding even via the API that IMO are worse at quantization, I've never had it actually finish an opencode task, https://github.com/ggml-org/llama.cpp/pull/19283#issuecomment-3870270263.

Sign up or log in to comment