Quant fail?

#5
by nimishchaudhari - opened

image

Trying to run smol_IQ3

"Step-3.5-Flash":
cmd: |
${ik_llama_cpp}
-m ${models_dir}/LLMs/Step-3.5-Flash/smol-IQ3_KS/Step-3.5-Flash-smol-IQ3_KS-00001-of-00003.gguf
--chat-template-file /home/nimish/Models/LLMs/Step-3.5-Flash/smol-IQ3_KS/step-3.5.jinja
--jinja
--temp 1.0
-ngl 99
-ot exps=CPU
-c 65565

@nimishchaudhari

That looks like some kind of looping. What exact version are you running e.g. ./build/bin/llama-server --version and give the git hash. The working ik_llama.cpp PR was merged just ~16 hours ago here: https://github.com/ikawrakow/ik_llama.cpp/pull/1240

Also, in my own testing just use the included chat template e.g. --jinja but remove --chat-template-file ... or supposedly this one is working well for tool use: https://github.com/ikawrakow/ik_llama.cpp/pull/1240

Finally, what backend are you using e.g. compiled with CUDA and what GPUs and I'm assuming Linux (though your command looks more like yaml than bash?)

One more idea: you can run with --validate-quants to make sure your download didn't introduce any NaNs etc.

cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF
These build options I might be missing some? :3, and yeah I'm on latest main of ik_llama.cpp .

Running Cuda 3090 + 3060 12GB + 192GB DDR5 5600.

nimish@jetpack-ai:~/Programs/ik_llama.cpp$ ./build/bin/llama-server --version
version: 4189 (e22b2d12)
built with cc (GCC) 15.2.1 20251211 (Red Hat 15.2.1-5) for x86_64-redhat-linux

I can run it using smol-IQ2_KS, but the long context performance degradation is too severe

@nimishchaudhari

I see one more commit new from ik regarding missing rope_frequency, i don't think that is something on my quants, but still catching up.

You should be able to use -sm graph with this model pretty sure, this is recommended to try when u have multi GPUs, check out this example command here: https://huggingface.co/ubergarm/Step-3.5-Flash-GGUF/discussions/8

@xldistance

The long context speed drop generally depends on the memory bandwidth you have available, and i assume it would have a similar drop rate for different size quants as well. What is your rig/cpu/ram/GPU(s) and full command. Have you tried benchmarking with llama-sweep-bench to visualize the drop off? (i might do that today). Also if you have multi GPU def use -sm graph

@xldistance

The drop off on this model doesn't look bad on full 2x GPU offload using ik_llama.cpp -sm graph feature and full f16 kv cache. Are you quantizing your cache or how are you using it?

sweep-bench-Step-3.5-Flash

The llama-server command for this just got added to the quick start section of the model card.

Sign up or log in to comment