Instructions to use zai-org/GLM-5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-5")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-5")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-5")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use zai-org/GLM-5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-5

SGLang

How to use zai-org/GLM-5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-5 with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-5
```

We need Air and we need Flash

by jacek2024 - opened Feb 11

Discussion

jacek2024

Feb 11

Please provide some Air or at least a Flash.

WaveCut

Feb 11

•

edited Feb 11

Of course, we are not in a position to make demands, so I would phrase the topic differently:

Dear Z.ai employees, we really like what you're doing and sincerely appreciate your efforts to democratize artificial intelligence and machine learning, especially the smaller models in your GLM lineup that are frequently SOTA in their class.

Therefore, we kindly ask you not to abandon this direction and to continue delighting us with wonderful consumer grade open-weights releases, like Flash, and, especially, Air models.

fpjnijweide

Feb 11

•

edited Feb 11

Alternatively, a native 4-bit quant release like Kimi-K2.5 or gpt-oss-120b would be nice. This model now no longer fits within 8xH100 and 4xH200 setups at FP8

simonko912

Feb 11

I would really like a flash model else my 32gb of ddr4 will cry.

raincandy-u

Feb 11

🤗🤗

SicariusSicariiStuff

Feb 11

Joining the request :)

SerialKicked

Feb 11

•

edited Feb 11

I think people are sleeping on Flash 4.7. You see "it's a bunch of 3B models glued together" and go "oh no". But after playing with it for a (long) while, it's been one of the best models in this (total) size range in... well, forever actually. Hint, hint, fine-tune. As much as a Flash 5.0 would be nice, what's the point when literally nobody toyed with 4.7 yet?

simple6502

Feb 11

+1 For an Air model! GLM-4.5-Air had both great performance for its size and active parameter count, but also had quite good world knowledge compared to other MoE models of similar size (within my own testing of course).

GLM-4.7-Flash is a nice model for its size, but there is a gap for a model in-between Flash and GLM-4.7/GLM-5, which GLM-4.5-Air filled nicely.

I am quite happy to have GLM-5 being open weights for such a strong model in many regards, but an Air variant would be quite amazing for the local community! A model ~100-150B parameters gives the density for general world knowledge and wide range of capabilities, while being small enough for single GPU+CPU setups!

simonko912

Feb 11

Even if 100b is enough for some people, for example my Rx 5700 xt and 32gb of ram require smaller models, I would like a flash to be honest, but both could be great!

simple6502

Feb 11

Agreed!

Nexesenex

Feb 11

I second that!
Glm5-Air 212b ?! :D

levulinh

Feb 12

Would love it if it comes true :)
I can only have the access to 8 H100 GPUs, so something at the size of GLM 4.7 for GLM 5 Air would be awesome.

Kosh69

Feb 12

The value of the glm4.5-air model is that it ran at acceptable speed and the highest possible quality on 64GB + 24GB! And it outperformed the gpt-oss 120b in many areas, not to mention everything that came with a smaller memory... except maybe the qwen 3 next coder... As for the later glm 4.6v, it's definitely not up to par, worse.

However, the 200+ model (minimax m2.1, step 3.5 flash) is more relevant for me now, as it has more memory (128GB + 24GB).

All this runs on an old (4-year-old) AMD Ryzen 5900X and RTX 3090 local computer. Lots of people have this kind of hardware! Especially with 64GB. Owners of computers with DDR5 memory actually get a 2x speed boost. This is very impressive, considering that a year and a half ago only substandard models with 8b parameters and without any practical use were available.

WaveCut

Feb 12

•

edited Feb 12

I'm still rooting for the 100B size because I believe it can be stretched further and become even better at this size. It has some capacity, but it will still run on consumer-grade hardware, such as my M2 Max 64GB @3 bit mlx-DWQ / iq3 GGUF.

keennay

Feb 12

Please, we local model runners would greatly appreciate 358B, 110B, and 31B param models. 4 models total: big boi (GLM-5), alongside the rest

orrorin

Feb 12

I would like a GLM 5 Flash model that targets a 32GB size, so that large context can also be used on 64GB of VRAM.

I am using GLM 4.7 Flash on 4x 5060 ti 16GB @ 128K context and it works wonderfully.

Green-eyedDevil

Feb 12

Something in the 120B to 140B would be excellent. I want to push local finetuning on my BWP6000 as far as possible, and that is about the maximum that I have calculated.