Instructions to use zai-org/GLM-5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-5")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-5")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-5")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use zai-org/GLM-5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-5

SGLang

How to use zai-org/GLM-5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-5 with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-5
```

[Bug/Optimization] Inconsistent whitespace control in `chat_template.jinja` breaking Radix Cache / Prefix Caching

#61

by JustinTong - opened Mar 9

Discussion

JustinTong

Mar 9

There is a whitespace handling inconsistency in the chat_template.jinja for GLM-5. Some control blocks lack explicit whitespace strippers ({%- and -%}), making the rendered output highly dependent on specific Jinja2 environment settings (trim_blocks and lstrip_blocks).

When these settings are not strictly enforced by the inference backend (or in custom implementations), the template injects redundant newlines and spaces that accumulate based on the number of messages in the conversation history.

Impact on Radix Cache (Prefix Caching)

Radix caching relies on stable token ID sequences. Because the whitespace changes depending on the message count, the resulting tokens for the prompt prefix diverge:

1-turn Conversation: [gMASK]<sop>\n \n<|user|> $\rightarrow$ Tokenized as ID 8942
3-turn Conversation: [gMASK]<sop>\n \n \n<|user|> $\rightarrow$ Tokenized as ID 78496

This divergence at the beginning of the sequence causes a Cache Miss. The system is forced to re-compute the KV Cache (Prefill) for every turn, significantly increasing Time To First Token (TTFT) and inference costs in production environments (e.g., vLLM, SGLang).

Root Cause Analysis

In the current chat_template.jinja:

Line 35:    {% set ns.last_user_index = loop.index0 -%}  {# Missing left stripper - #}
...
Line 38:    {% for m in messages %}                     {# Missing both strippers - #}

Line 35: Lacks {%-. Without global lstrip_blocks, it preserves the 8-space indentation.
Line 38: Lacks {%- and -%}. Without global trim_blocks, it preserves the trailing newline.

While some loaders (like transformers) enable these flags by default, many optimized C++ backends or custom scripts do not, leading to "Hidden Token Drift."

Suggested Fix

Apply Defensive Programming by explicitly stripping whitespace within the template to ensure environment-agnostic output:

{# Suggested Change for Line 35 #}
{%- set ns.last_user_index = loop.index0 -%}

{# Suggested Change for Line 38 #}
{%- for m in messages -%}

Steps to Reproduce

Render the template using a standard Jinja2 environment without trim_blocks=True or lstrip_blocks=True for different message lengths, and observe the varying number of tokens between <sop> and <|user|>.

ZHANGYUXUAN-zR

Z.ai org Mar 11

Thank you for your feedback, the issue has been reproduced and the template will be updated.

ZHANGYUXUAN-zR changed discussion status to closed Mar 11

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment