Fixed jinja chat templates for Qwen 3.5 & 3.6 (v19)

Update History & Changelog (v19)

2026-05-18 Update (v19): The Agentic Loop Cure. (1) Abolished "Empty Think" Poisoning: Rewrote the AST history rendering to completely remove the injection of empty <think>\n</think> blocks. This cures a severe in-context learning bias where the model assumed tools could only be called if it didn't think first, which was causing 80%+ of premature <|im_end|> turn aborts. (2) System Prompt Logic Trap Removed: Softened the absolute tool mandate in the <IMPORTANT> block and restored Universal Synthesis instructions. The model is now explicitly permitted to transition from </think> to a conversational answer without panicking. (3) True 100% KV Cache & Amnesia Fix: preserve_thinking now defaults to true. Past thoughts are retained chronologically, permanently curing "amnesia stalls" during multi-step tool loops while mathematically guaranteeing 100% KV Cache prefix matching out-of-the-box.

Update History & Changelog (v11-v18)

2026-05-16 Update (v18): Stability & Precision Patch. (1) Bulletproof False-Positive Detection: Shifted agentic error detection from broad substring matching to strict structural formats (e.g., "error":, Exception:, Traceback), completely curing false-positive retry loops when successful JSON returns simply contain the word "error" or "fail". (2) Legacy Engine Compatibility: Replaced loop.previtem with explicit array indexing, fixing AST crashes on older llama.cpp and minijinja builds that do not track loop state items. (3) True Whitespace Normalization: Fixed a bug where reasoning bypasses and hallucinated tag recovery stacked hidden multi-newlines (\n\n\n), strictly fulfilling the 100% KV Cache hit rate claim for all edge cases. (4) Code Cleanup: Removed dead conditional branches during XML tool parsing.

2026-05-15 Update (v17): Major architecture overhaul resolving edge cases in agentic tooling and KV Cache. (1) Unified Template: Consolidated Qwen 3.5 and Qwen 3.6 into a single chat_template.jinja file that handles all variants seamlessly. (2) Fixed "Mutually Exclusive" Stopping Bug: Changed the history-pruning logic from wiping the entire turn to safely array-slicing out just the raw tool tags (content.split('<tool_call>')[0]). This preserves the conversational text in the history, which cures the bug where the model would artificially abort its turn (output <|im_end|>) when it wanted to talk and use a tool simultaneously. (3) 100% KV Cache Hit Rate Restoration: Fully normalized internal whitespace logic (\n\n -> \n) around think blocks and tool calls to exactly match the model's native autoregressive generation spacing. This perfectly synchronizes the template's rendered history with the cached generated tokens, completely eliminating the severe cache invalidation and full-prompt re-processing issues present in v16.

2026-05-14 Update (v16): Four-part fix addressing community-reported regressions. (1) Native XML tool format: reverted from JSON back to the native <function=name> / <parameter=x> format the model was trained on, restoring full compatibility with vLLM's qwen3_coder parser and all inference engines that implement the Qwen tool protocol. (2) --reasoning off respected in error paths: when thinking is disabled (enable_thinking=false / --reasoning off), the error escalation directives are now injected as plain text without opening any <think> block, preventing degenerate prompts in no-reasoning sessions. (3) Smarter false-positive detection: short shell command results (starting with $ ) and search results with timing footers (Took X.Xs) are now correctly excluded from error detection, preventing tool-retry loops when commands succeed but their output happens to contain the word error. (4) consecutive_failures counter no longer resets on assistant messages, allowing Tier 2 escalation to actually fire across multi-turn tool retry chains.

2026-05-13 Update (v15): Three-part fix for agentic tool-loop failures. (1) Two-tier error escalation: replaced the brittle backwards-lookahead error detection with a fully forward-tracking last_tool_failed + consecutive_failures counter. On the first error the generation prompt is pre-seeded with a correction directive inside <think>; on the 2nd+ consecutive error the think block is bypassed and an out-of-band directive forces an immediate corrected action. (2) Length-gated detection: error signals are only read from short tool responses (< 500 chars), preventing false positives when reading code files containing error, exception, etc. in legitimate content. (3) Static system prompt: tool instructions are now fully unconditional, permanently eliminating the KV cache invalidation vector introduced in v14.

2026-05-12 Update (v14): Cured tool amnesia loops and post-tool overthinking friction! Implemented Smart Loop Preservation to dynamically scan subsequent tool returns for error markers and conditionally preserve historical reasoning context during active tool failures. Broadened the system instruction scope to define <think> as a dual-purpose planning or synthesis space, completely eliminating indecisiveness post-tool retrieval.

2026-05-11 Update (v13): Radical simplification and compatibility overhaul! Reverted tool schemas and assistant output formatting to standard JSON to natively fix downstream MCP parser crashes and C++ implicit enum coercion bugs. Removed the ns_scan history loop to permanently fix KV cache invalidation mid-conversation. Replaced global string replacement for hallucinated tags with a C++ safe, localized array-slicing method to prevent data-corruption on user code blocks.

2026-05-10 Update (v12): Fixed agent stalls, parameter data-loss, and hallucination bugs! Restored dynamic tool instructions and the <IMPORTANT> formatting reminder block to stop grammar parser crashes.

2026-05-10 Update (v11): Fixed agent looping and overthinking! Re-implemented preserve_thinking kwarg to properly strip reasoning blocks from history by default, and restored the reasoning bypass (<think>\n\n</think>\n\n).

This is a drop-in Jinja template that fixes rendering errors, KV cache invalidation, token waste, and fatal agentic stalling in the official Qwen chat templates.

It is tested to work across LM Studio, llama.cpp, vLLM, MLX, oMLX, and any engine that supports HuggingFace Jinja templates.


Why you need this

The official Qwen templates contain restrictions and Python-specific Jinja logic that break usage on many inference engines and agent frameworks.

Here are the critical issues this template fixes:

Category Problem Impact Fix
Agentic Loop Premature Stalls (Stopping Bug) Model aborts its turn (<|im_end|>) when trying to combine conversation and a tool call. Resolved the System Prompt logic trap and cured "Empty Think" poisoning (v19).
Agentic Loop Retry Stall & Reasoning Spiral Model correctly diagnoses a tool error but repeatedly emits the identical failing <tool_call>. Two-tier escalation: seeds <think> with correction directive; injects urgent out-of-band directive.
Agentic Loop Post-Tool Overthinking Forced <think> block prefilling causes model to panic and debate internal rules after fetching data. Broadened instructions to define <think> as a dual-purpose space for planning or synthesis.
Agentic Loop False-Positive Error Detection Short successful API/JSON returns containing the word error trigger false retry loops. Strict structural guards look for exact system failures ("error":, Traceback, etc.) instead of broad words (v18).
Performance KV Cache Invalidation History pruning dynamically mutates past turns, causing full prompt re-processing every turn. preserve_thinking defaults to true, maintaining strict chronological rendering for a 100% KV cache hit rate (v19).
Performance Empty Think Poisoning Stripped past turns leave behind empty <think></think> tags, tricking the model into a severe in-context learning bias. Template completely abolishes the injection of empty think blocks (v19).
Compatibility Legacy Engine Crashes Older C++ parsing engines crash when evaluating loop.previtem. Uses strict chronological array indexing universally supported by all Jinja iterations (v18).
Compatibility Wrong Tool Call Format Qwen-native parsers (like vLLM's qwen3_coder) expect XML <function=name>. JSON format breaks them. Restored native XML format while keeping C++ safety.
Compatibility Jinja C++ Crashes Python-specific filters (map, first on strings) crash on minijinja. All filters replaced with universally compatible equivalents.
Stability Mid-Conversation System Crash Frameworks injecting mid-conversation steering instructions trigger a hard crash. Native, chronological rendering for system messages anywhere in the history.
Stability No-User-Query Crash raise_exception crashes agentic loops or system-only contexts. Graceful fallback implemented.
Stability Unclosed Thinking Before Tool Model calls a tool without closing its reasoning, bleeding XML tags into tool parsers. Auto-injects closing tags before tool boundaries securely.
Edge Cases developer Role Rejected Modern APIs send the developer role; the official template rejects it. Added full support for "developer".
Edge Cases --reasoning off Ignored When thinking is disabled, tool error escalation still opened a <think> block, corrupting the prompt. Error escalation branches now fully respect enable_thinking=false.
Edge Cases Reasoning Bypass Hallucinations When thinking is disabled, Qwen models inherently hallucinate reasoning tags anyway. Injects a safe boundary to successfully force reasoning bypass without stacking newlines (v18).

Quick install

Choose your environment and update the template:

LM Studio

  1. Open your Qwen model in the right-side panel.
  2. Scroll down to Prompt Template.
  3. Replace the template with the contents of chat_template.jinja.
  4. Click Save.

llama.cpp / koboldcpp

--jinja --chat-template-file chat_template.jinja

vLLM

Replace the "chat_template" string in your tokenizer_config.json with the raw file contents. Use the qwen3_coder tool parser:

--tool-call-parser qwen3_coder

oMLX

Overwrite chat_template.jinja in your local model directory. Load with --jinja. Remove any chat_template_kwargs overrides because the template handles everything internally.


Which file do I use?

Both Qwen 3.5 and Qwen 3.6 variants (including 35B, 32B, 27B, and 14B parameters) have been consolidated. You only need the single chat_template.jinja file at the root of the repository.

One-line versions (chat_template_oneline.txt) are pre-minified for engines that require a single-line template string.


The thinking toggle

You can control the model reasoning behavior. Insert <|think_on|> or <|think_off|> anywhere in your system or user prompt.

The template natively intercepts the tag, removes it from the final context so the model never sees it, and flips the reasoning mode instantly.

Fast answer, no reasoning:

System: You are a coding assistant. <|think_off|>
User: What's 2+2?

Deep reasoning:

System: You are a coding assistant. <|think_on|>
User: Implement a red-black tree in Rust.

(The tag syntax uses Qwen's control-token delimiters to guarantee it will never collide with legitimate text or file paths, unlike earlier community templates that used /think)


Token Saving: Stripping past thoughts

By default in v19, this template preserves all past <think> blocks in the chat history. This is intentional: it prevents the model from suffering "amnesia stalls" during complex, multi-step agentic loops, and it mathematically guarantees a 100% Prefix KV Cache hit rate on local inference engines.

However, if you are running constrained hardware and need to save context tokens, you can explicitly disable this feature in your engine's template kwargs to automatically strip past thoughts:

{
  "preserve_thinking": false
}

(Note: Setting this to false will naturally reduce your KV Cache hit rate during multi-turn chats, as the prompt string will dynamically mutate).


Technical Details of the Critical Fixes

1. The "Empty Think" Poisoning & Logic Trap Cure (v19)

Previous versions attempted to save tokens by replacing past thoughts with empty <think>\n</think> blocks, combined with an absolute system prompt demanding a tool be called immediately after </think>. This created a toxic in-context learning pattern: the model associated empty thoughts with tools, and full thoughts with forbidden conversational text, causing an 80%+ premature <|im_end|> stalling rate. v19 abolishes empty think injection and rewrites the <IMPORTANT> directives to explicitly authorize conversational synthesis after a thought block.

2. KV Cache Safety & Autoregressive Normalization (v18/v19)

Llama.cpp and vLLM utilize prefix KV caching to speed up generation. Because v19 now preserves historical thoughts chronologically by default, the rendered history perfectly synchronizes with the cached generated tokens. Combined with strict single \n normalization at autoregressive boundaries, this achieves a 100% KV Cache hit rate in multi-turn loops.

3. Native XML Tool Call Format (v16)

The model was trained with the XML-based tool call format used by Qwen3-Coder:

<tool_call>
<function=tool_name>
<parameter=param_name>
value
</parameter>
</function>
</tool_call>

v16 restored this format natively, making it compatible with all parsers while bypassing the |items crash by using C++ safe key iteration (for args_name in tool_call.arguments).

4. Two-Tier Agentic Error Escalation (v15)

When a tool call fails validation repeatedly, the model can enter a degenerate reasoning spiral. This template leverages a two-tier escalation system driven by a forward-tracked consecutive_failures counter:

  • Tier 1 (1st error): Generation prompt prefix changes to seed reasoning at a different token position, breaking the cached attractor state.
  • Tier 2 (2nd+ consecutive errors): Think block bypassed entirely. An urgent out-of-band directive forces an immediate corrected action wrapped safely within the user tool_response block.

5. Smart False-Positive Detection (v18)

Instead of broad substring matching that triggers false retry-loops on successful database returns containing words like "error", v18 utilizes strict structural guards looking for Exception:, "error":, Traceback, and command not found, combined with length gates and shell-echo exclusions ($ ).

6. minijinja Compatibility Constraints (v18)

Python-only Jinja2 features crash on minijinja (the C++ runtime used by llama.cpp, LM Studio, and MLX). All instances have been refactored for universal support:

  • \| items -> for key in mapping
  • loop.previtem -> messages[loop.index0 - 1] (v18)
  • map('string') -> join('|')
  • \| first -> '$ ' in content
Comparison Matrix: Official vs Fixed vs Community
Feature Official Qwen Templates LuffyTheFox mod-ellary Pneuny This Fixed Template (v19)
Tool call format XML (native) JSON JSON JSON XML (native, qwen3_coder compatible)
Tool arguments Fails (|items) Fixed Missing Fixed Fixed (C++ safe XML)
Premature Stalls (Stopping Bug) Stalls Stalls Stalls Stalls Fixed via Logic Trap / Poisoning removal (v19)
Agentic Retry Stall & Reasoning Spiral Stalls Stalls Stalls Stalls Two-tier escalation system
False-positive tool errors N/A N/A N/A N/A Guarded (Strict structural matching)
Post-Tool Overthinking Spams/Stalls Broken Broken Broken Universal Synthesis
--reasoning off on tool errors N/A N/A N/A N/A Fully respected
developer role Missing Missing Missing Missing Added
Thinking toggle None None /think (system only) None <|think_off|> anywhere
Empty think in history Spams empty blocks Broken Tags omitted Broken Abolished completely (v19)
KV prefix caching Breaks on dynamic history Breaks Breaks Breaks 100% stable out-of-the-box (v19)
Mid-conversation system Crashes Crashes Crashes Crashes Fixed
No-user-query crash Crashes Crashes Crashes Crashes Graceful fallback
Legacy AST support Fails (previtem) Fails Fails Fails Fixed (index0)
</thinking> hallucination Fails N/A N/A N/A Detected and safely trimmed

Running the test suite

python3 scripts/test_v18.py

Tests cover: XML tool format, tool instructions, thinking bypass, <|think_off|> / <|think_on|>, Tier 1 & 2 escalation, length-gated detection, shell/search false positives, --reasoning off + errors, counter reset, historical think stripping, preserve_thinking, developer role, mid-conversation system, tool response wrapping, and string argument passthrough.


Authorship

Role Author
Original models Alibaba Cloud (Qwen team)
Template fixes froggeric

License

Apache-2.0, inherited from Qwen.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support