[Bug/Optimization] Inconsistent whitespace control in `chat_template.jinja` breaking Radix Cache / Prefix Caching

#61
by JustinTong - opened

There is a whitespace handling inconsistency in the chat_template.jinja for GLM-5. Some control blocks lack explicit whitespace strippers ({%- and -%}), making the rendered output highly dependent on specific Jinja2 environment settings (trim_blocks and lstrip_blocks).

When these settings are not strictly enforced by the inference backend (or in custom implementations), the template injects redundant newlines and spaces that accumulate based on the number of messages in the conversation history.

Impact on Radix Cache (Prefix Caching)

Radix caching relies on stable token ID sequences. Because the whitespace changes depending on the message count, the resulting tokens for the prompt prefix diverge:

  • 1-turn Conversation: [gMASK]<sop>\n \n<|user|> $\rightarrow$ Tokenized as ID 8942
  • 3-turn Conversation: [gMASK]<sop>\n \n \n<|user|> $\rightarrow$ Tokenized as ID 78496

This divergence at the beginning of the sequence causes a Cache Miss. The system is forced to re-compute the KV Cache (Prefill) for every turn, significantly increasing Time To First Token (TTFT) and inference costs in production environments (e.g., vLLM, SGLang).

Root Cause Analysis

In the current chat_template.jinja:

Line 35:    {% set ns.last_user_index = loop.index0 -%}  {# Missing left stripper - #}
...
Line 38:    {% for m in messages %}                     {# Missing both strippers - #}

  • Line 35: Lacks {%-. Without global lstrip_blocks, it preserves the 8-space indentation.
  • Line 38: Lacks {%- and -%}. Without global trim_blocks, it preserves the trailing newline.

While some loaders (like transformers) enable these flags by default, many optimized C++ backends or custom scripts do not, leading to "Hidden Token Drift."

Suggested Fix

Apply Defensive Programming by explicitly stripping whitespace within the template to ensure environment-agnostic output:

{# Suggested Change for Line 35 #}
{%- set ns.last_user_index = loop.index0 -%}

{# Suggested Change for Line 38 #}
{%- for m in messages -%}

Steps to Reproduce

Render the template using a standard Jinja2 environment without trim_blocks=True or lstrip_blocks=True for different message lengths, and observe the varying number of tokens between <sop> and <|user|>.

Thank you for your feedback, the issue has been reproduced and the template will be updated.

ZHANGYUXUAN-zR changed discussion status to closed

Sign up or log in to comment