Image-Text-to-Text
Transformers
GGUF
llama.cpp
vision
multimodal
text-generation-inference
unsloth
conversational
qwen3_5
reasoning
chain-of-thought
lora
sft
agent
tool-use
function-calling
coder

Reasoning loop in llamacpp

#2
by Grandys - opened

running with the llama server and stuck in a reasoning loop. never-ending reasoning. Are there any tips for inference settings? Now I am using this command

.\llamacpp\llama-server --model "D:\ai-tools\llm\Jackrong\Qwopus3.5-9B-Coder-GGUF\Qwopus3.5-9B-coder-Exp-Q5_K_M.gguf" --mmproj "D:\ai-tools\llm\Jackrong\Qwopus3.5-9B-Coder-GGUF\mmproj.gguf" --ctx-size 131072 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --port 8001 --reasoning on -fa on --fit on --no-mmap -ctk q8_0 -ctv q8_0 --no-warmup -np 1 --prio 2 --mlock --jinja

The reasoning loop happens when I do the car wash test. Before using kv at q8, I use the default kv cache, and the reasoning loop still happens.

Have you tried this? https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates

Not yet ill try it. Thank you! btw which chat template i need to download that suitable with this model? There is a lot of variety available.

Have you tried this? https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates

Not yet ill try it. Thank you! btw which chat template i need to download that suitable with this model? There is a lot of variety available.

https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates/resolve/main/chat_template.jinja?download=true

https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates/resolve/main/chat_template.jinja?download=true

already tried it. still loop. Even funnier, the think tag got leaked outside the reasoning block, and the loop happened in the chat/response block

Yep, it's loopy

same here, never endin loop with basic command

Dont use those arguments. Test with inly temp 0.7 and without—jinja. If u use jinja use froggeric directly

lmstudio defaults - i1.iq4_xs i1.q4_k_m, q6_k - loops in simple file querry comand from hermes-agent

I had this same issue before. It only affects custom models (including this one), and it's probably a consequence of overtraining. I tried it on Qwopus and Deepseek. I switched to standard unsloth models, and I've had no issues since.
There's a chance the situation might improve after patching https://github.com/ggml-org/llama.cpp/pull/23690 on the llama.cpp side, but havent tested it yet.

I was able to use it (no loops), with llama.cpp using default arguments.

However... Qwopus3.5-9B-coder-Exp-Q4_K_M.gguf (maybe others) is using the full memory of the 24B model after unpacking & running inference.
May as well get the 24B model, as the only benefits to this smaller mod are download time & speed (it does run 2x faster).

im strugled, never ending reasoning loop. tried change temperature, kv cache type, presence penalty, repeat penalty, nothing helps. Is that useless? tried with llama.cpp and lmstudio. im hangup.

Sign up or log in to comment