Instructions to use zai-org/GLM-4.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-4.5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-4.5")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.5")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.5")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use zai-org/GLM-4.5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-4.5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-4.5

SGLang

How to use zai-org/GLM-4.5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-4.5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-4.5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-4.5 with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-4.5
```

AWQ 4Bit / GPTQ with full precision gates and head? Please

by chriswritescode - opened Jul 28, 2025

Discussion

chriswritescode

Jul 28, 2025

I'm super impressed with the Air model. Unfortunately that's the only model that I can run at FP8.

ZHANGYUXUAN-zR

Z.ai org Jul 28, 2025

Try here:
https://huggingface.co/collections/mlx-community/glm-45-68877db0ef7a8762e2992e24

chriswritescode

Jul 28, 2025

looking for vllm or sglang supported quants please.

twhitworth

Jul 28, 2025

•

edited Jul 28, 2025

looking for vllm or sglang supported quants please.

Working on the following:

AWQ 4-bit, FP16 gates & lm_head
– Model card explicitly lists skip_layers: ["lm_head", "router"], so the router logits and final head remain un-quantized.
GPTQ 4-bit, FP16 gates & lm_head
– quantize_config.json shows "true_sequential": true, "lm_head": false, "router": false, keeping those layers in FP16.
INT4 w4a16 version and INT8 w8a8 version with 2:4 sparsity. (Targeting Cuda 8.6 architecture)

Will update with once finished.

LeePapa

Jul 29, 2025

GPTQ quant would be amazing. need it for my 4 x V100 (Volta architecture)

twhitworth

Jul 29, 2025

•

edited Jul 29, 2025

Grinded all night to finish the quants off. Pushed it a little too hard and kernel panic took me out around 5am and haven't been able to resolve before heading into work.
Have about 8 different variations done. My plan was to drop them all this morning, but tomorrow morning it is.

My expectations are the highest for these three variations.

FP16 Model Spec GLM-4.5-Air: (FP16 frame of reference):
Total Parameters: 106B
Active Parameters: 12B
Base FP16 Size: 218.25 GB
Expert Fraction: ~30%
Context Length: 128k

Critical Components: FP16: router, gate_network and lm_head layers
Quantization for linear and expert weights: int8-w8a8: model weights | int8-w8a8: expert weights
Sparsity: 2:4 50% sparsity: model weights | 2:4 50% sparsity: expert weights
(FP16 218GB -> 73.12GB) in MoE models 50% 2:4 sparsity is nearly lossless with a tiny amount of post training, without post training you're looking at 3% to 5% loss.
This also doesn't account for the speed ups from using half the of the entire model weights.

Will run extensive testing post release.

my other two favorites after lunch :D

nagug

Aug 14, 2025

any luck?

twhitworth

Aug 15, 2025

This had to go on the back burner but I am planning to drop by the end of the weekend.

yanikita

Aug 21, 2025

Would love to see sglang supported quants

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment