Instructions to use zai-org/GLM-4.5-Air with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-4.5-Air with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-4.5-Air")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.5-Air")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.5-Air", device_map="auto")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use zai-org/GLM-4.5-Air with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-4.5-Air"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.5-Air",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-4.5-Air

SGLang

How to use zai-org/GLM-4.5-Air with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-4.5-Air" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.5-Air",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-4.5-Air" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.5-Air",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-4.5-Air with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-4.5-Air
```

Disable thinking mode?

by daaain - opened Jul 28, 2025

Discussion

daaain

Jul 28, 2025

Is there a special token to disable thinking? I'm using the MLX version if that matters

AbyssianOne

Jul 29, 2025

I'm sorry, I'm useless to you since I don't use MLX and can't run this yes... but I wanted to say thank you for making me spit my coffee out laughing at what looked like a request for a "Disabled thinking mode."

ZHANGYUXUAN-zR

Z.ai org Jul 29, 2025

Yes, please check our chat template.

daaain changed discussion title from Disabled thinking mode? to Disable thinking mode? Jul 29, 2025

daaain

Jul 29, 2025

Thanks, so if I understand correctly, either write /nothink or use enable_thinking in the template if the inference library supports it?

https://huggingface.co/zai-org/GLM-4.5-Air/blob/main/chat_template.jinja#L47

@AbyssianOne haha, the irony of being too autistic to notice 😅 or maybe just the temporary disability of being too tired...

ZHANGYUXUAN-zR

Z.ai org Jul 29, 2025

yes, vLLM and sglang supoort enable_thinking params,check our github

daaain

Jul 29, 2025

Thanks a lot! I'm GPU poor, so only llama.cpp and mlx-lm (via LM Studio currently) for me 😅

But also have to say this model is an absolute sweet spot for people with more powerful Macs, I'm getting 20 tokens / sec on my M2 Max laptop with the 4bit quant, so really grateful for your work!

DUOWEN

Aug 4, 2025

当我用“”标签测试GLM4.5时偶然发现它又关闭思考模式的效果，我们知道如果把这个标签输入给DeepSeek或Qwen的思考模型时模型往往会输出奇怪的东西。

When I was testing GLM4.5 with the "" tag, I accidentally discovered that it turned off the thinking mode. We know that if this tag is input into DeepSeek or Qwen's thinking model, the model will often output strange stuff.

Fernanda24

Aug 5, 2025

•

edited Aug 5, 2025

no think glm-4.5 jinja: https://gist.github.com/qingy1337/2ee429967662a4d6b06eb59787f7dc53

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment