Instructions to use zai-org/GLM-4.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zai-org/GLM-4.5 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="zai-org/GLM-4.5") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.5") model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.5") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use zai-org/GLM-4.5 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "zai-org/GLM-4.5" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-4.5", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/zai-org/GLM-4.5
- SGLang
How to use zai-org/GLM-4.5 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "zai-org/GLM-4.5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-4.5", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "zai-org/GLM-4.5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-4.5", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use zai-org/GLM-4.5 with Docker Model Runner:
docker model run hf.co/zai-org/GLM-4.5
We Have Gemini At Home
All jokes aside, this model was blatantly trained on Gemini's outputs. It reads the same, makes the same mistakes, and has the same writing style. I had Gemini Flash set as a fallback model on OpenRouter, and I couldn't tell the difference between the reply from GLM and the previously mentioned one.
If you love locally run Gemini, this model is for you. Otherwise, don't bother, and go for the actual Gemini, since that one is smarter and has a better context (this one is barely usable on 64k). Hybrid thinking is never a good idea, as we've seen in Qwen3's example. Keep in mind, I tested the model in role-playing/creative writing scenarios. It might do better at coding.
To devs, don't mind my harsh review, I have very high expectations. The gooner crowd is very tough to please. Keep up the good work and cheers.
Lol, tru. GLM-4.5 performs better at coding and agentic capabilixties. But with such a cost of API, what can I say? Not bad.
tru,lol
Trying to use this for RP/writing on OpenRouter. It's available on NanoGPT, but seems to still be set to placeholder pricing (ie. $200.00 input rate, lol). (Edit: It's no longer on placeholder pricing and is now priced at $0.20 input).
So I do like how it's very cheap pricing. Even at filled 24k ctx it only uses $0.015 per inference. Gemini-2.5 is free tho, and free beats cheap, so I still give the nod to Gemini-2.5-Pro.
I noticed Chat Completion caused group chat chars to speak as each other in other char messages. Text Completion seems to fix this.
It's also slower than Gemini-2.5-Pro, and I know it's unfair to compare GLM-4.5 via OpenRouter to Google's main servers, but it is what it is. It's also possible OpenRouter is getting slammed as I randomly get "too many request" error messages, so hopefully the situation improves.
I'm going to hold off reviewing the writing until I see more as OpenRouter is just way too slow or unresponsive with it atm.
