Instructions to use zai-org/GLM-4.7 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-4.7 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-4.7")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.7")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use zai-org/GLM-4.7 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-4.7"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-4.7

SGLang

How to use zai-org/GLM-4.7 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-4.7" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-4.7" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-4.7 with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-4.7
```

fix: `clear_thinking` inserts spurious `</think>` tag when reasoning_content is empty

#46

by beckyu - opened Mar 10

base: refs/heads/main

←

from: refs/pr/46

Discussion Files changed

-5

beckyu

Mar 10

Problem

In multi-turn conversations, the current chat_template.jinja inserts a bare </think> tag into historical assistant messages even when reasoning_content is empty. This causes issues with inference frameworks (e.g., SGLang) that rely on </think> token detection to control thinking budget.

Root Cause

The clear_thinking rendering block in the original template has an unconditional else branch that always inserts </think>:

{%- if ((clear_thinking is defined and not clear_thinking) or loop.index0 > ns.last_user_index) and reasoning_content -%}
{{ '<think>' + reasoning_content.strip() +  '</think>'}}
{%- else -%}
{{ '</think>' }}   ← always inserts </think>, even when reasoning_content is empty
{%- endif -%}

When reasoning_content is empty (which is common — users typically don't provide reasoning_content in multi-turn history), the else branch fires and unconditionally inserts </think>. This renders historical assistant messages as:

<|assistant|>
</think>
This is a previous answer

Impact

Inference frameworks like SGLang use a ThinkingBudgetLogitProcessor that scans the full input token sequence for </think> to determine whether the thinking phase has already ended:

# SGLang ThinkingBudgetLogitProcessor
if self.THINKING_END_TOKEN_ID in cur_ids:  # cur_ids = input + output
    continue  # skip budget enforcement

The spurious </think> tag from history tricks the processor into believing thinking has already concluded, causing it to skip budget enforcement entirely. As a result:

1st turn: thinking_budget works correctly
2nd turn onwards: thinking_budget silently fails — model generates unbounded thinking content

Fix

Replace the clear_thinking / reasoning rendering block with cleaner logic:

{#- clear_thinking=true means clear thinking content, no thinking tags at all -#}
{%- if clear_thinking is not defined or not clear_thinking -%}
    {#- clear_thinking=false or undefined: keep thinking content -#}
    {%- if reasoning_content -%}
{{ '<think>' + reasoning_content.strip() + '</think>' }}
    {%- endif -%}
{%- endif -%}
{#- clear_thinking=true: only output content, no thinking tags -#}

Behavior Matrix

`clear_thinking`	`reasoning_content`	Before (bug)	After (fix)
`false` / undefined	non-empty	`<think>...</think>`	`<think>...</think>`
`false` / undefined	empty	`</think>` (spurious)	(nothing)
`true`	non-empty	`</think>` (spurious)	(nothing — cleared as intended)
`true`	empty	`</think>` (spurious)	(nothing)

Key change: when reasoning_content is empty, no thinking-related tags are emitted at all, regardless of clear_thinking.

Verification

Scenario	Before fix	After fix
Single-turn with `thinking_budget`	Works	Works
Multi-turn with `thinking_budget`	Fails (budget ignored from 2nd turn)	Works
`clear_thinking=true` with reasoning history	Inserts spurious `</think>`	Clean output, no tags
`clear_thinking=false` with reasoning history	Works	Works (no change)

Scope

This is a minimal, targeted fix — only the clear_thinking rendering block is changed
All other template logic (tools, generation prompt, enable_thinking, etc.) remains untouched
The same issue also affects zai-org/GLM-4.7

fix: `clear_thinking` inserts spurious `</think>` tag when reasoning_content is emptyc530cd99

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment