Reasoning off mode issue

#30

by GergelyZsolt - opened May 27

•

I tried multiple qwen models with llama.cpp in --reasoning off mode, and it happens very often that an orphaned </think> tag appears. It did not do this with other chat templates.

kaotd

May 28

here too:

config.ini:

[*]
n-gpu-layers = all
ctx-size = 65536
threads = 18
batch-size  = 2048
ubatch-size = 1024

parallel = 2
mlock = true
mmap = true
; no-mmap = true
flash-attn = true

cache-type-k = q8_0
cache-type-v = q8_0
cache-type-k-draft = q8_0
cache-type-v-draft = q8_0

reasoning = false
prio = 3
seed = 3407
jinja = true

[Qwen3.6-35B-A3B:UD-Q4_K_XL]
model = /models/Qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
mmproj = /models/Qwen3.6/mmproj-F16.gguf
temperature = 0.7
top-p = 0.8
top-k = 20
min-p = 0.0
presence-penalty = 1.5
repeat-penalty = 1.0
image-min-tokens = 1024
spec-type = draft-mtp
spec-draft-n-max = 2
chat-template-file = /templates/froggeric-chat_template-v19.jinja

open-webui v0.9.5

prompt:

**Build a VRAM and KV cache calculator tool for llama.cpp server.** The tool should include the following parameters: model type (e.g., Qwen2.5-72B), bit precision (4-bit/8-bit), total `--ctx-size`, number of `--parallel` slots, and batch size. The output should display the estimated VRAM usage, KV cache allocation per slot, and warnings if this configuration exceeds physical GPU limits to avoid 400 errors or stuttering/lag when running multiple concurrent threads.

response:

I'll build a comprehensive VRAM and KV Cache Calculator for llama.cpp server. Let me first research the current understanding of these calculations to ensure accuracy.

<function=web_search>
<parameter=search_queries>
["llama.cpp VRAM calculation KV cache formula 2024", "llama.cpp --ctx-size --parallel VRAM usage calculator", "KV cache memory calculation transformer models bits per token"]
</parameter>
</function>
</tool_call>

tooltd

May 28

This comment has been hidden (marked as Off-Topic)

BebopVox

May 31

•

edited May 31

Any solutions?

I have this block

BebopVox

May 31

Found this one, works nice:
https://huggingface.co/spiritbuun/buun-Qwen3.6-chat_template

kaotd

Jun 1

Found this one, works nice:
https://huggingface.co/spiritbuun/buun-Qwen3.6-chat_template

Thanks, I will give a try.

froggeric

Owner Jun 5

In the v20 release, I completely overhauled the thinking toggles and state tracking to handle reasoning-off environments better. Please try the latest v20 template and see if that cleans up the orphaned tags. If you're still seeing it, you might need to update your llama.cpp server to the latest build.

BebopVox

Jun 13

👋

In the v20 release, I completely overhauled the thinking toggles and state tracking to handle reasoning-off environments better. Please try the latest v20 template and see if that cleans up the orphaned tags. If you're still seeing it, you might need to update your llama.cpp server to the latest build.

Thank you for this!

However I've been working with this template tonight and running PiehSoft/Qwen3.6-40B-Deckard-MTP model overnight. And now see these error logs:

Now let me update the test file to inline the functions.
 ⤵ 1K  ⤴ 81  cache: 95K
┌─── ✎ Write: 🟦 src/widgets/Speedometer/__tests__/Speedometer.test.ts · 1 line ──────────────────────────────────────────────────────────────────────────┐
│   1 import { describe, it, expect, vi,                                                                                                                  │
│ ✘ Diagnostics (1 error(s))                                                                                                                              │
│  └─ 🟦 src/widgets/Speedometer/__tests__/Speedometer.test.ts                                                                                            │
│    └─ ✘:1:35 '}' expected. (1005)                                                                                                                       │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
 Error: 500 Failed to parse tool call arguments as JSON: [json.exception.parse_error.101] parse error at line 1, colu… 
...
Error: Retry failed after 10 attempts: 500 Failed to parse tool call arguments as JSON: [json.exception.parse_error.101] parse error at line 1, column
 168: syntax error while parsing value - invalid string: missing closing quote; last read: '"import { describe, it, expect, vi,'

Also in llamacpp server logs I got this:

175.27.099.097 W srv    operator(): got exception: {"error":{"code":500,"message":"Failed to parse tool call arguments as JSON: [json.exception.parse_error.101] parse error at line 1, column 168: syntax error while parsing value - invalid string: missing closing quote; last read: '\"import { describe, it, expect, vi,'","type":"server_error"}}

Not sure it's a template problem but this is exactly what was highlighted to me by my assistant model.
Its said that this template has probable fix to add kwarg auto_disable_thinking_with_tools. Please have a look and maybe this template also needs it? :

https://huggingface.co/spiritbuun/buun-Qwen3.6-chat_template

BebopVox

Jun 13

This comment has been hidden (marked as Spam)

froggeric

Owner 29 days ago

Hi! The massive v21 update was just released. It completely overhauled tool-calling compatibility (switching to native Hermes JSON for inference engines like llama.cpp and LM Studio), fixed the preserve_thinking amnesia stalls/loops, and resolved several </think> parsing and prompt injection bugs.

This issue should now be fully resolved in v21. I am closing this discussion, but please feel free to reopen or create a new one if you are still experiencing any trouble. Thanks!

froggeric changed discussion status to closed 29 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment