[Bug/Optimization] Inconsistent whitespace control in `chat_template.jinja` breaking Radix Cache / Prefix Caching
There is a whitespace handling inconsistency in the chat_template.jinja for GLM-5. Some control blocks lack explicit whitespace strippers ({%- and -%}), making the rendered output highly dependent on specific Jinja2 environment settings (trim_blocks and lstrip_blocks).
When these settings are not strictly enforced by the inference backend (or in custom implementations), the template injects redundant newlines and spaces that accumulate based on the number of messages in the conversation history.
Impact on Radix Cache (Prefix Caching)
Radix caching relies on stable token ID sequences. Because the whitespace changes depending on the message count, the resulting tokens for the prompt prefix diverge:
- 1-turn Conversation:
[gMASK]<sop>\n \n<|user|>$\rightarrow$ Tokenized as ID8942 - 3-turn Conversation:
[gMASK]<sop>\n \n \n<|user|>$\rightarrow$ Tokenized as ID78496
This divergence at the beginning of the sequence causes a Cache Miss. The system is forced to re-compute the KV Cache (Prefill) for every turn, significantly increasing Time To First Token (TTFT) and inference costs in production environments (e.g., vLLM, SGLang).
Root Cause Analysis
In the current chat_template.jinja:
Line 35: {% set ns.last_user_index = loop.index0 -%} {# Missing left stripper - #}
...
Line 38: {% for m in messages %} {# Missing both strippers - #}
- Line 35: Lacks
{%-. Without globallstrip_blocks, it preserves the 8-space indentation. - Line 38: Lacks
{%-and-%}. Without globaltrim_blocks, it preserves the trailing newline.
While some loaders (like transformers) enable these flags by default, many optimized C++ backends or custom scripts do not, leading to "Hidden Token Drift."
Suggested Fix
Apply Defensive Programming by explicitly stripping whitespace within the template to ensure environment-agnostic output:
{# Suggested Change for Line 35 #}
{%- set ns.last_user_index = loop.index0 -%}
{# Suggested Change for Line 38 #}
{%- for m in messages -%}
Steps to Reproduce
Render the template using a standard Jinja2 environment without trim_blocks=True or lstrip_blocks=True for different message lengths, and observe the varying number of tokens between <sop> and <|user|>.
Thank you for your feedback, the issue has been reproduced and the template will be updated.