Instructions to use zai-org/GLM-4-9B-0414 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-4-9B-0414 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-4-9B-0414")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4-9B-0414")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4-9B-0414")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use zai-org/GLM-4-9B-0414 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-4-9B-0414"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4-9B-0414",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-4-9B-0414

SGLang

How to use zai-org/GLM-4-9B-0414 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-4-9B-0414" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4-9B-0414",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-4-9B-0414" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4-9B-0414",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-4-9B-0414 with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-4-9B-0414
```

I get too many repetitions

by JLouisBiz - opened Apr 15, 2025

Discussion

JLouisBiz

Apr 15, 2025

I'm using the quantized model Q8 with llama.cpp and I still get too many repetitions

MrDevolver

Apr 15, 2025

I'm using the quantized model Q8 with llama.cpp and I still get too many repetitions

Unfortunately, this model currently doesn't work well with llama.cpp. 😢

JLouisBiz

Apr 15, 2025

Does it work well without directly?

ZHANGYUXUAN-zR

Z.ai org Apr 16, 2025

The reason we haven't released the quantized model is also because we encountered serious loss issues after quantization. We are looking into how to solve this. Currently, directly using the quantized model in llama cpp will result in serious performance loss and cannot complete basic tasks.

MrDevolver

Apr 17, 2025

•

edited Apr 17, 2025

The reason we haven't released the quantized model is also because we encountered serious loss issues after quantization. We are looking into how to solve this. Currently, directly using the quantized model in llama cpp will result in serious performance loss and cannot complete basic tasks.

I fell in love with your models long time ago, they are great models, but they are like forbidden fruit for me, because I cannot use them without proper GGUF support. 😢

If you could please spare some time assisting those who are working on GGUF inference engines such as llamacpp with implementing proper support for your models, please do so. I would appreciate it very much and I'm sure many others would do as well! ❤

I absolutely love your screenshots with the content your models can generate. They are absolutely lovely and stunning, full of extra detail I would not expect to get with such simple prompts! I'd also like to thank you for publishing the prompts that were used to generate it. With those prompts I was able to test various different models on lmarena for comparison. This is my favorite "Create a misty Jiangnan scene using SVG." and I was very impressed by the output of your model:

It may be using simple shapes, but overall the image is beautiful and detailed. When I tested the same prompt with much bigger commercial models, they either failed completely or the generated images were not as detailed and pretty as the one generated by your model.

For example, this is from o3-mini, using the same prompt:

I think your model would be a real gem, a real star among the models for local inference, if we could only use it in llamacpp, which is the only way I can personally run these models.

ZHANGYUXUAN-zR

Z.ai org Apr 17, 2025

We have received a large number of suggestions from quantitative analysis, and we are coordinating with staff to try to have them complete the calibration quantification within a certain period of time, especially for the 32B model. However, I still don't know how long it will take.

Goekdeniz-Guelmez

Apr 19, 2025

The reason is that the partial_rotary_factor is not used in RoPE in llama.cpp You'd have to set it yourself: Issue with the resolve

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment