Instructions to use zai-org/GLM-4-9B-0414 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zai-org/GLM-4-9B-0414 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="zai-org/GLM-4-9B-0414") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4-9B-0414") model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4-9B-0414") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use zai-org/GLM-4-9B-0414 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "zai-org/GLM-4-9B-0414" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-4-9B-0414", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/zai-org/GLM-4-9B-0414
- SGLang
How to use zai-org/GLM-4-9B-0414 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "zai-org/GLM-4-9B-0414" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-4-9B-0414", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "zai-org/GLM-4-9B-0414" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zai-org/GLM-4-9B-0414", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use zai-org/GLM-4-9B-0414 with Docker Model Runner:
docker model run hf.co/zai-org/GLM-4-9B-0414
I get too many repetitions
I'm using the quantized model Q8 with llama.cpp and I still get too many repetitions
I'm using the quantized model Q8 with llama.cpp and I still get too many repetitions
Unfortunately, this model currently doesn't work well with llama.cpp. 😢
Does it work well without directly?
The reason we haven't released the quantized model is also because we encountered serious loss issues after quantization. We are looking into how to solve this. Currently, directly using the quantized model in llama cpp will result in serious performance loss and cannot complete basic tasks.
The reason we haven't released the quantized model is also because we encountered serious loss issues after quantization. We are looking into how to solve this. Currently, directly using the quantized model in llama cpp will result in serious performance loss and cannot complete basic tasks.
I fell in love with your models long time ago, they are great models, but they are like forbidden fruit for me, because I cannot use them without proper GGUF support. 😢
If you could please spare some time assisting those who are working on GGUF inference engines such as llamacpp with implementing proper support for your models, please do so. I would appreciate it very much and I'm sure many others would do as well! ❤
I absolutely love your screenshots with the content your models can generate. They are absolutely lovely and stunning, full of extra detail I would not expect to get with such simple prompts! I'd also like to thank you for publishing the prompts that were used to generate it. With those prompts I was able to test various different models on lmarena for comparison. This is my favorite "Create a misty Jiangnan scene using SVG." and I was very impressed by the output of your model:
It may be using simple shapes, but overall the image is beautiful and detailed. When I tested the same prompt with much bigger commercial models, they either failed completely or the generated images were not as detailed and pretty as the one generated by your model.
For example, this is from o3-mini, using the same prompt:
I think your model would be a real gem, a real star among the models for local inference, if we could only use it in llamacpp, which is the only way I can personally run these models.
We have received a large number of suggestions from quantitative analysis, and we are coordinating with staff to try to have them complete the calibration quantification within a certain period of time, especially for the 32B model. However, I still don't know how long it will take.
The reason is that the partial_rotary_factor is not used in RoPE in llama.cpp You'd have to set it yourself: Issue with the resolve