Instructions to use froggeric/Qwen-Fixed-Chat-Templates with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use froggeric/Qwen-Fixed-Chat-Templates with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Qwen-Fixed-Chat-Templates froggeric/Qwen-Fixed-Chat-Templates
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Reasoning off mode issue
#30
by GergelyZsolt - opened
I tried multiple qwen models with llama.cpp in --reasoning off mode, and it happens very often that an orphaned </think> tag appears. It did not do this with other chat templates.
here too:
config.ini:
[*]
n-gpu-layers = all
ctx-size = 65536
threads = 18
batch-size = 2048
ubatch-size = 1024
parallel = 2
mlock = true
mmap = true
; no-mmap = true
flash-attn = true
cache-type-k = q8_0
cache-type-v = q8_0
cache-type-k-draft = q8_0
cache-type-v-draft = q8_0
reasoning = false
prio = 3
seed = 3407
jinja = true
[Qwen3.6-35B-A3B:UD-Q4_K_XL]
model = /models/Qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
mmproj = /models/Qwen3.6/mmproj-F16.gguf
temperature = 0.7
top-p = 0.8
top-k = 20
min-p = 0.0
presence-penalty = 1.5
repeat-penalty = 1.0
image-min-tokens = 1024
spec-type = draft-mtp
spec-draft-n-max = 2
chat-template-file = /templates/froggeric-chat_template-v19.jinja
open-webui v0.9.5
prompt:
**Build a VRAM and KV cache calculator tool for llama.cpp server.** The tool should include the following parameters: model type (e.g., Qwen2.5-72B), bit precision (4-bit/8-bit), total `--ctx-size`, number of `--parallel` slots, and batch size. The output should display the estimated VRAM usage, KV cache allocation per slot, and warnings if this configuration exceeds physical GPU limits to avoid 400 errors or stuttering/lag when running multiple concurrent threads.
response:
I'll build a comprehensive VRAM and KV Cache Calculator for llama.cpp server. Let me first research the current understanding of these calculations to ensure accuracy.
<function=web_search>
<parameter=search_queries>
["llama.cpp VRAM calculation KV cache formula 2024", "llama.cpp --ctx-size --parallel VRAM usage calculator", "KV cache memory calculation transformer models bits per token"]
</parameter>
</function>
</tool_call>
This comment has been hidden (marked as Off-Topic)
Found this one, works nice:
https://huggingface.co/spiritbuun/buun-Qwen3.6-chat_template
Thanks, I will give a try.
