Quant fail?
That looks like some kind of looping. What exact version are you running e.g. ./build/bin/llama-server --version and give the git hash. The working ik_llama.cpp PR was merged just ~16 hours ago here: https://github.com/ikawrakow/ik_llama.cpp/pull/1240
Also, in my own testing just use the included chat template e.g. --jinja but remove --chat-template-file ... or supposedly this one is working well for tool use: https://github.com/ikawrakow/ik_llama.cpp/pull/1240
Finally, what backend are you using e.g. compiled with CUDA and what GPUs and I'm assuming Linux (though your command looks more like yaml than bash?)
One more idea: you can run with --validate-quants to make sure your download didn't introduce any NaNs etc.
cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF
These build options I might be missing some? :3, and yeah I'm on latest main of ik_llama.cpp .
Running Cuda 3090 + 3060 12GB + 192GB DDR5 5600.
nimish@jetpack-ai:~/Programs/ik_llama.cpp$ ./build/bin/llama-server --version
version: 4189 (e22b2d12)
built with cc (GCC) 15.2.1 20251211 (Red Hat 15.2.1-5) for x86_64-redhat-linux
I can run it using smol-IQ2_KS, but the long context performance degradation is too severe
I see one more commit new from ik regarding missing rope_frequency, i don't think that is something on my quants, but still catching up.
You should be able to use -sm graph with this model pretty sure, this is recommended to try when u have multi GPUs, check out this example command here: https://huggingface.co/ubergarm/Step-3.5-Flash-GGUF/discussions/8
The long context speed drop generally depends on the memory bandwidth you have available, and i assume it would have a similar drop rate for different size quants as well. What is your rig/cpu/ram/GPU(s) and full command. Have you tried benchmarking with llama-sweep-bench to visualize the drop off? (i might do that today). Also if you have multi GPU def use -sm graph
The drop off on this model doesn't look bad on full 2x GPU offload using ik_llama.cpp -sm graph feature and full f16 kv cache. Are you quantizing your cache or how are you using it?
The llama-server command for this just got added to the quick start section of the model card.

