Instructions to use zai-org/GLM-4.7 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zai-org/GLM-4.7 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="zai-org/GLM-4.7") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7") model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.7") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use zai-org/GLM-4.7 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "zai-org/GLM-4.7" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-4.7", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/zai-org/GLM-4.7
- SGLang
How to use zai-org/GLM-4.7 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "zai-org/GLM-4.7" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-4.7", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "zai-org/GLM-4.7" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-4.7", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use zai-org/GLM-4.7 with Docker Model Runner:
docker model run hf.co/zai-org/GLM-4.7
fix: `clear_thinking` inserts spurious `</think>` tag when reasoning_content is empty
Problem
In multi-turn conversations, the current chat_template.jinja inserts a bare </think> tag into historical assistant messages even when reasoning_content is empty. This causes issues with inference frameworks (e.g., SGLang) that rely on </think> token detection to control thinking budget.
Root Cause
The clear_thinking rendering block in the original template has an unconditional else branch that always inserts </think>:
{%- if ((clear_thinking is defined and not clear_thinking) or loop.index0 > ns.last_user_index) and reasoning_content -%}
{{ '<think>' + reasoning_content.strip() + '</think>'}}
{%- else -%}
{{ '</think>' }} β always inserts </think>, even when reasoning_content is empty
{%- endif -%}
When reasoning_content is empty (which is common β users typically don't provide reasoning_content in multi-turn history), the else branch fires and unconditionally inserts </think>. This renders historical assistant messages as:
<|assistant|>
</think>
This is a previous answer
Impact
Inference frameworks like SGLang use a ThinkingBudgetLogitProcessor that scans the full input token sequence for </think> to determine whether the thinking phase has already ended:
# SGLang ThinkingBudgetLogitProcessor
if self.THINKING_END_TOKEN_ID in cur_ids: # cur_ids = input + output
continue # skip budget enforcement
The spurious </think> tag from history tricks the processor into believing thinking has already concluded, causing it to skip budget enforcement entirely. As a result:
- 1st turn:
thinking_budgetworks correctly - 2nd turn onwards:
thinking_budgetsilently fails β model generates unbounded thinking content
Fix
Replace the clear_thinking / reasoning rendering block with cleaner logic:
{#- clear_thinking=true means clear thinking content, no thinking tags at all -#}
{%- if clear_thinking is not defined or not clear_thinking -%}
{#- clear_thinking=false or undefined: keep thinking content -#}
{%- if reasoning_content -%}
{{ '<think>' + reasoning_content.strip() + '</think>' }}
{%- endif -%}
{%- endif -%}
{#- clear_thinking=true: only output content, no thinking tags -#}
Behavior Matrix
clear_thinking |
reasoning_content |
Before (bug) | After (fix) |
|---|---|---|---|
false / undefined |
non-empty | <think>...</think> |
<think>...</think> |
false / undefined |
empty | </think> (spurious) |
(nothing) |
true |
non-empty | </think> (spurious) |
(nothing β cleared as intended) |
true |
empty | </think> (spurious) |
(nothing) |
Key change: when reasoning_content is empty, no thinking-related tags are emitted at all, regardless of clear_thinking.
Verification
| Scenario | Before fix | After fix |
|---|---|---|
Single-turn with thinking_budget |
Works | Works |
Multi-turn with thinking_budget |
Fails (budget ignored from 2nd turn) | Works |
clear_thinking=true with reasoning history |
Inserts spurious </think> |
Clean output, no tags |
clear_thinking=false with reasoning history |
Works | Works (no change) |
Scope
- This is a minimal, targeted fix β only the
clear_thinkingrendering block is changed - All other template logic (tools, generation prompt,
enable_thinking, etc.) remains untouched - The same issue also affects zai-org/GLM-4.7