Instructions to use phucngodev/Qwen-Fixed-Chat-Templates with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use phucngodev/Qwen-Fixed-Chat-Templates with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Qwen-Fixed-Chat-Templates phucngodev/Qwen-Fixed-Chat-Templates
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Fixed jinja chat templates for Qwen 3.5 & 3.6 (v19)
Update History & Changelog (v19)
2026-05-18 Update (v19): The Agentic Loop Cure. (1) Abolished "Empty Think" Poisoning: Rewrote the AST history rendering to completely remove the injection of empty
<think>\n</think>blocks. This cures a severe in-context learning bias where the model assumed tools could only be called if it didn't think first, which was causing 80%+ of premature<|im_end|>turn aborts. (2) System Prompt Logic Trap Removed: Softened the absolute tool mandate in the<IMPORTANT>block and restored Universal Synthesis instructions. The model is now explicitly permitted to transition from</think>to a conversational answer without panicking. (3) True 100% KV Cache & Amnesia Fix:preserve_thinkingnow defaults totrue. Past thoughts are retained chronologically, permanently curing "amnesia stalls" during multi-step tool loops while mathematically guaranteeing 100% KV Cache prefix matching out-of-the-box.
Update History & Changelog (v11-v18)
2026-05-16 Update (v18): Stability & Precision Patch. (1) Bulletproof False-Positive Detection: Shifted agentic error detection from broad substring matching to strict structural formats (e.g.,
"error":,Exception:,Traceback), completely curing false-positive retry loops when successful JSON returns simply contain the word "error" or "fail". (2) Legacy Engine Compatibility: Replacedloop.previtemwith explicit array indexing, fixing AST crashes on olderllama.cppandminijinjabuilds that do not track loop state items. (3) True Whitespace Normalization: Fixed a bug where reasoning bypasses and hallucinated tag recovery stacked hidden multi-newlines (\n\n\n), strictly fulfilling the 100% KV Cache hit rate claim for all edge cases. (4) Code Cleanup: Removed dead conditional branches during XML tool parsing.2026-05-15 Update (v17): Major architecture overhaul resolving edge cases in agentic tooling and KV Cache. (1) Unified Template: Consolidated Qwen 3.5 and Qwen 3.6 into a single
chat_template.jinjafile that handles all variants seamlessly. (2) Fixed "Mutually Exclusive" Stopping Bug: Changed the history-pruning logic from wiping the entire turn to safely array-slicing out just the raw tool tags (content.split('<tool_call>')[0]). This preserves the conversational text in the history, which cures the bug where the model would artificially abort its turn (output<|im_end|>) when it wanted to talk and use a tool simultaneously. (3) 100% KV Cache Hit Rate Restoration: Fully normalized internal whitespace logic (\n\n->\n) around think blocks and tool calls to exactly match the model's native autoregressive generation spacing. This perfectly synchronizes the template's rendered history with the cached generated tokens, completely eliminating the severe cache invalidation and full-prompt re-processing issues present in v16.2026-05-14 Update (v16): Four-part fix addressing community-reported regressions. (1) Native XML tool format: reverted from JSON back to the native
<function=name>/<parameter=x>format the model was trained on, restoring full compatibility with vLLM'sqwen3_coderparser and all inference engines that implement the Qwen tool protocol. (2)--reasoning offrespected in error paths: when thinking is disabled (enable_thinking=false/--reasoning off), the error escalation directives are now injected as plain text without opening any<think>block, preventing degenerate prompts in no-reasoning sessions. (3) Smarter false-positive detection: short shell command results (starting with$) and search results with timing footers (Took X.Xs) are now correctly excluded from error detection, preventing tool-retry loops when commands succeed but their output happens to contain the worderror. (4)consecutive_failurescounter no longer resets on assistant messages, allowing Tier 2 escalation to actually fire across multi-turn tool retry chains.2026-05-13 Update (v15): Three-part fix for agentic tool-loop failures. (1) Two-tier error escalation: replaced the brittle backwards-lookahead error detection with a fully forward-tracking
last_tool_failed+consecutive_failurescounter. On the first error the generation prompt is pre-seeded with a correction directive inside<think>; on the 2nd+ consecutive error the think block is bypassed and an out-of-band directive forces an immediate corrected action. (2) Length-gated detection: error signals are only read from short tool responses (< 500 chars), preventing false positives when reading code files containingerror,exception, etc. in legitimate content. (3) Static system prompt: tool instructions are now fully unconditional, permanently eliminating the KV cache invalidation vector introduced in v14.2026-05-12 Update (v14): Cured tool amnesia loops and post-tool overthinking friction! Implemented Smart Loop Preservation to dynamically scan subsequent tool returns for error markers and conditionally preserve historical reasoning context during active tool failures. Broadened the system instruction scope to define
<think>as a dual-purpose planning or synthesis space, completely eliminating indecisiveness post-tool retrieval.2026-05-11 Update (v13): Radical simplification and compatibility overhaul! Reverted tool schemas and assistant output formatting to standard JSON to natively fix downstream MCP parser crashes and C++ implicit enum coercion bugs. Removed the
ns_scanhistory loop to permanently fix KV cache invalidation mid-conversation. Replaced global string replacement for hallucinated tags with a C++ safe, localized array-slicing method to prevent data-corruption on user code blocks.2026-05-10 Update (v12): Fixed agent stalls, parameter data-loss, and hallucination bugs! Restored dynamic tool instructions and the
<IMPORTANT>formatting reminder block to stop grammar parser crashes.2026-05-10 Update (v11): Fixed agent looping and overthinking! Re-implemented
preserve_thinkingkwarg to properly strip reasoning blocks from history by default, and restored the reasoning bypass (<think>\n\n</think>\n\n).
This is a drop-in Jinja template that fixes rendering errors, KV cache invalidation, token waste, and fatal agentic stalling in the official Qwen chat templates.
It is tested to work across LM Studio, llama.cpp, vLLM, MLX, oMLX, and any engine that supports HuggingFace Jinja templates.
Why you need this
The official Qwen templates contain restrictions and Python-specific Jinja logic that break usage on many inference engines and agent frameworks.
Here are the critical issues this template fixes:
| Category | Problem | Impact | Fix |
|---|---|---|---|
| Agentic Loop | Premature Stalls (Stopping Bug) | Model aborts its turn (<|im_end|>) when trying to combine conversation and a tool call. |
Resolved the System Prompt logic trap and cured "Empty Think" poisoning (v19). |
| Agentic Loop | Retry Stall & Reasoning Spiral | Model correctly diagnoses a tool error but repeatedly emits the identical failing <tool_call>. |
Two-tier escalation: seeds <think> with correction directive; injects urgent out-of-band directive. |
| Agentic Loop | Post-Tool Overthinking | Forced <think> block prefilling causes model to panic and debate internal rules after fetching data. |
Broadened instructions to define <think> as a dual-purpose space for planning or synthesis. |
| Agentic Loop | False-Positive Error Detection | Short successful API/JSON returns containing the word error trigger false retry loops. |
Strict structural guards look for exact system failures ("error":, Traceback, etc.) instead of broad words (v18). |
| Performance | KV Cache Invalidation | History pruning dynamically mutates past turns, causing full prompt re-processing every turn. | preserve_thinking defaults to true, maintaining strict chronological rendering for a 100% KV cache hit rate (v19). |
| Performance | Empty Think Poisoning | Stripped past turns leave behind empty <think></think> tags, tricking the model into a severe in-context learning bias. |
Template completely abolishes the injection of empty think blocks (v19). |
| Compatibility | Legacy Engine Crashes | Older C++ parsing engines crash when evaluating loop.previtem. |
Uses strict chronological array indexing universally supported by all Jinja iterations (v18). |
| Compatibility | Wrong Tool Call Format | Qwen-native parsers (like vLLM's qwen3_coder) expect XML <function=name>. JSON format breaks them. |
Restored native XML format while keeping C++ safety. |
| Compatibility | Jinja C++ Crashes | Python-specific filters (map, first on strings) crash on minijinja. |
All filters replaced with universally compatible equivalents. |
| Stability | Mid-Conversation System Crash | Frameworks injecting mid-conversation steering instructions trigger a hard crash. | Native, chronological rendering for system messages anywhere in the history. |
| Stability | No-User-Query Crash | raise_exception crashes agentic loops or system-only contexts. |
Graceful fallback implemented. |
| Stability | Unclosed Thinking Before Tool | Model calls a tool without closing its reasoning, bleeding XML tags into tool parsers. | Auto-injects closing tags before tool boundaries securely. |
| Edge Cases | developer Role Rejected |
Modern APIs send the developer role; the official template rejects it. | Added full support for "developer". |
| Edge Cases | --reasoning off Ignored |
When thinking is disabled, tool error escalation still opened a <think> block, corrupting the prompt. |
Error escalation branches now fully respect enable_thinking=false. |
| Edge Cases | Reasoning Bypass Hallucinations | When thinking is disabled, Qwen models inherently hallucinate reasoning tags anyway. | Injects a safe boundary to successfully force reasoning bypass without stacking newlines (v18). |
Quick install
Choose your environment and update the template:
LM Studio
- Open your Qwen model in the right-side panel.
- Scroll down to Prompt Template.
- Replace the template with the contents of
chat_template.jinja. - Click Save.
llama.cpp / koboldcpp
--jinja --chat-template-file chat_template.jinja
vLLM
Replace the "chat_template" string in your tokenizer_config.json with the raw file contents. Use the qwen3_coder tool parser:
--tool-call-parser qwen3_coder
oMLX
Overwrite chat_template.jinja in your local model directory. Load with --jinja. Remove any chat_template_kwargs overrides because the template handles everything internally.
Which file do I use?
Both Qwen 3.5 and Qwen 3.6 variants (including 35B, 32B, 27B, and 14B parameters) have been consolidated. You only need the single chat_template.jinja file at the root of the repository.
One-line versions (chat_template_oneline.txt) are pre-minified for engines that require a single-line template string.
The thinking toggle
You can control the model reasoning behavior. Insert <|think_on|> or <|think_off|> anywhere in your system or user prompt.
The template natively intercepts the tag, removes it from the final context so the model never sees it, and flips the reasoning mode instantly.
Fast answer, no reasoning:
System: You are a coding assistant. <|think_off|>
User: What's 2+2?
Deep reasoning:
System: You are a coding assistant. <|think_on|>
User: Implement a red-black tree in Rust.
(The tag syntax uses Qwen's control-token delimiters to guarantee it will never collide with legitimate text or file paths, unlike earlier community templates that used /think)
Token Saving: Stripping past thoughts
By default in v19, this template preserves all past <think> blocks in the chat history. This is intentional: it prevents the model from suffering "amnesia stalls" during complex, multi-step agentic loops, and it mathematically guarantees a 100% Prefix KV Cache hit rate on local inference engines.
However, if you are running constrained hardware and need to save context tokens, you can explicitly disable this feature in your engine's template kwargs to automatically strip past thoughts:
{
"preserve_thinking": false
}
(Note: Setting this to false will naturally reduce your KV Cache hit rate during multi-turn chats, as the prompt string will dynamically mutate).
Technical Details of the Critical Fixes
1. The "Empty Think" Poisoning & Logic Trap Cure (v19)
Previous versions attempted to save tokens by replacing past thoughts with empty <think>\n</think> blocks, combined with an absolute system prompt demanding a tool be called immediately after </think>. This created a toxic in-context learning pattern: the model associated empty thoughts with tools, and full thoughts with forbidden conversational text, causing an 80%+ premature <|im_end|> stalling rate. v19 abolishes empty think injection and rewrites the <IMPORTANT> directives to explicitly authorize conversational synthesis after a thought block.
2. KV Cache Safety & Autoregressive Normalization (v18/v19)
Llama.cpp and vLLM utilize prefix KV caching to speed up generation. Because v19 now preserves historical thoughts chronologically by default, the rendered history perfectly synchronizes with the cached generated tokens. Combined with strict single \n normalization at autoregressive boundaries, this achieves a 100% KV Cache hit rate in multi-turn loops.
3. Native XML Tool Call Format (v16)
The model was trained with the XML-based tool call format used by Qwen3-Coder:
<tool_call>
<function=tool_name>
<parameter=param_name>
value
</parameter>
</function>
</tool_call>
v16 restored this format natively, making it compatible with all parsers while bypassing the |items crash by using C++ safe key iteration (for args_name in tool_call.arguments).
4. Two-Tier Agentic Error Escalation (v15)
When a tool call fails validation repeatedly, the model can enter a degenerate reasoning spiral. This template leverages a two-tier escalation system driven by a forward-tracked consecutive_failures counter:
- Tier 1 (1st error): Generation prompt prefix changes to seed reasoning at a different token position, breaking the cached attractor state.
- Tier 2 (2nd+ consecutive errors): Think block bypassed entirely. An urgent out-of-band directive forces an immediate corrected action wrapped safely within the user
tool_responseblock.
5. Smart False-Positive Detection (v18)
Instead of broad substring matching that triggers false retry-loops on successful database returns containing words like "error", v18 utilizes strict structural guards looking for Exception:, "error":, Traceback, and command not found, combined with length gates and shell-echo exclusions ($ ).
6. minijinja Compatibility Constraints (v18)
Python-only Jinja2 features crash on minijinja (the C++ runtime used by llama.cpp, LM Studio, and MLX). All instances have been refactored for universal support:
\| items->for key in mappingloop.previtem->messages[loop.index0 - 1](v18)map('string')->join('|')\| first->'$ ' in content
Comparison Matrix: Official vs Fixed vs Community
| Feature | Official Qwen Templates | LuffyTheFox | mod-ellary | Pneuny | This Fixed Template (v19) |
|---|---|---|---|---|---|
| Tool call format | XML (native) | JSON | JSON | JSON | XML (native, qwen3_coder compatible) |
| Tool arguments | Fails (|items) |
Fixed | Missing | Fixed | Fixed (C++ safe XML) |
| Premature Stalls (Stopping Bug) | Stalls | Stalls | Stalls | Stalls | Fixed via Logic Trap / Poisoning removal (v19) |
| Agentic Retry Stall & Reasoning Spiral | Stalls | Stalls | Stalls | Stalls | Two-tier escalation system |
| False-positive tool errors | N/A | N/A | N/A | N/A | Guarded (Strict structural matching) |
| Post-Tool Overthinking | Spams/Stalls | Broken | Broken | Broken | Universal Synthesis |
--reasoning off on tool errors |
N/A | N/A | N/A | N/A | Fully respected |
developer role |
Missing | Missing | Missing | Missing | Added |
| Thinking toggle | None | None | /think (system only) |
None | <|think_off|> anywhere |
| Empty think in history | Spams empty blocks | Broken | Tags omitted | Broken | Abolished completely (v19) |
| KV prefix caching | Breaks on dynamic history | Breaks | Breaks | Breaks | 100% stable out-of-the-box (v19) |
| Mid-conversation system | Crashes | Crashes | Crashes | Crashes | Fixed |
| No-user-query crash | Crashes | Crashes | Crashes | Crashes | Graceful fallback |
| Legacy AST support | Fails (previtem) |
Fails | Fails | Fails | Fixed (index0) |
</thinking> hallucination |
Fails | N/A | N/A | N/A | Detected and safely trimmed |
Running the test suite
python3 scripts/test_v18.py
Tests cover: XML tool format, tool instructions, thinking bypass, <|think_off|> / <|think_on|>, Tier 1 & 2 escalation, length-gated detection, shell/search false positives, --reasoning off + errors, counter reset, historical think stripping, preserve_thinking, developer role, mid-conversation system, tool response wrapping, and string argument passthrough.
Authorship
| Role | Author |
|---|---|
| Original models | Alibaba Cloud (Qwen team) |
| Template fixes | froggeric |
License
Apache-2.0, inherited from Qwen.