Instructions to use zai-org/GLM-4.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-4.5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-4.5")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.5")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.5")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use zai-org/GLM-4.5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-4.5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-4.5

SGLang

How to use zai-org/GLM-4.5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-4.5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-4.5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-4.5 with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-4.5
```

We Have Gemini At Home

by MarinaraSpaghetti - opened Jul 28, 2025

Discussion

MarinaraSpaghetti

Jul 28, 2025

MarinaraSpaghetti

Jul 28, 2025

All jokes aside, this model was blatantly trained on Gemini's outputs. It reads the same, makes the same mistakes, and has the same writing style. I had Gemini Flash set as a fallback model on OpenRouter, and I couldn't tell the difference between the reply from GLM and the previously mentioned one.

If you love locally run Gemini, this model is for you. Otherwise, don't bother, and go for the actual Gemini, since that one is smarter and has a better context (this one is barely usable on 64k). Hybrid thinking is never a good idea, as we've seen in Qwen3's example. Keep in mind, I tested the model in role-playing/creative writing scenarios. It might do better at coding.

To devs, don't mind my harsh review, I have very high expectations. The gooner crowd is very tough to please. Keep up the good work and cheers.

Oliver80

Jul 28, 2025

Lol, tru. GLM-4.5 performs better at coding and agentic capabilixties. But with such a cost of API, what can I say? Not bad.

Derek-tel

Jul 29, 2025

tru,lol

ReMeDy-TV

Jul 29, 2025

•

edited Jul 30, 2025

Trying to use this for RP/writing on OpenRouter. It's available on NanoGPT, but seems to still be set to placeholder pricing (ie. $200.00 input rate, lol). (Edit: It's no longer on placeholder pricing and is now priced at $0.20 input).

So I do like how it's very cheap pricing. Even at filled 24k ctx it only uses $0.015 per inference. Gemini-2.5 is free tho, and free beats cheap, so I still give the nod to Gemini-2.5-Pro.

I noticed Chat Completion caused group chat chars to speak as each other in other char messages. Text Completion seems to fix this.

It's also slower than Gemini-2.5-Pro, and I know it's unfair to compare GLM-4.5 via OpenRouter to Google's main servers, but it is what it is. It's also possible OpenRouter is getting slammed as I randomly get "too many request" error messages, so hopefully the situation improves.

I'm going to hold off reviewing the writing until I see more as OpenRouter is just way too slow or unresponsive with it atm.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment