Instructions to use zai-org/GLM-5.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zai-org/GLM-5.1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="zai-org/GLM-5.1") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-5.1") model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-5.1") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use zai-org/GLM-5.1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "zai-org/GLM-5.1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-5.1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/zai-org/GLM-5.1
- SGLang
How to use zai-org/GLM-5.1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "zai-org/GLM-5.1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-5.1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "zai-org/GLM-5.1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-5.1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use zai-org/GLM-5.1 with Docker Model Runner:
docker model run hf.co/zai-org/GLM-5.1
Why is the API for GLM-5.1 more expensive than GLM-5 when the model size is the same?
Hi team and community,
I noticed that the API pricing for GLM-5.1 is higher than GLM-5 on the Z.ai platform:
GLM-5.1: Input $1.4 / Output $4.4
GLM-5: Input $1 / Output $3.2
As far as I know, both models share the same architecture and parameter size (744B total, 40B active MoE).
So my question is: Why the price increase?
Is the inference efficiency worse due to defaults like Thinking Mode or agentic optimizations? Or is it purely a business decision (value-based pricing) because GLM-5.1 is highly optimized and much smarter via post-training?
What puzzles me most is that since GLM-5 and GLM-5.1 share the same architecture and parameter size, the inference cost (hardware requirement) should be identical. In an open-source ecosystem, anyone hosting the model would simply replace 5 with 5.1 at zero additional operational cost.
Therefore, choosing 5 over 5.1 just because it's 'cheaper' seems fundamentally irrational from a purely technical standpoint. Is this API pricing strictly a business strategy (value-based pricing to recover R&D costs), or is there an invisible technical overhead in 5.1 that I'm missing?"
I'd love to hear the technical or strategic reasons behind this. Thanks!
Yes, I also wonder why GLM-5 shares the core technology DSA with and having a comparable size with DeepSeek-V3.2 (744B-A40B vs 671B-A37B) but is several times the price of the latter, it might be purely commercial considerations. (as you can notice that almost all providers on OpenRouter match their price to the official's)
I suspect there might ( not sure) be 2 reason for this :
1)Chinese computation is much cheaper (due to abundance of energy and subsidy)....even though amarican chips are better .... , So American servers (like openrouter) easily gets undercut in front of Chinese computation ...
2) Data War: using point-(1) as leverage .....Chinese Companies are aggressively selling their own API/Openclaw services (even at a loss) [...that's one of the reasons that some Chinese models are getting proprietary(like glm-turbo series)].......so if you don't want to pay premium ...grab their CODING-plan🤓.
What puzzles me most is that since GLM-5 and GLM-5.1 share the same architecture and parameter size, the inference cost (hardware requirement) should be identical.
This assumption is not correct which might be where the confusion comes in. Here is a short explanation from ChatGPT:
Since GLM-5 and GLM-5.1 appear to be very similar MoE models with roughly the same active parameter count, their baseline per-token compute and minimum weight-memory requirements should be broadly similar. But their real inference cost is not guaranteed to be identical, because serving cost also depends on exact parameter count, routing behavior, attention implementation, context/output lengths, quantization, batching, cache behavior, inference framework, and any “thinking”/agentic usage patterns.