The model keeps repeating responses when running for a while, need to update to the latest llamacpp
#1
by
xldistance
- opened
I downloaded the ubergarm/Step-3.5-Flash-GGUF model and it runs normally, but your GGUF quantization and also the AesSedai/Step-3.5-Flash-GGUF I tried both have issues - later on they just keep repeating or producing meaningless output
The quants work beautifully.
You MUST update to the latest version of llama.cpp (min build: version: 7970, b7970-eb449cdfa). Make sure to use the --jinja option only: llama-cli -m Step-3.5-Flash-PRISM-LITE-IQ2_M.gguf --jinja
You may also turn "thinking" off with this model: llama-server -m Step-3.5-Flash-PRISM-LITE-IQ2_M.gguf --jinja --chat-template-kwargs '{"enable_thinking": false}'
llama.cpp also discovered this issue
https://github.com/ggml-org/llama.cpp/pull/19283#issuecomment-3870270263

