Instructions to use zai-org/GLM-4.7-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-4.7-Flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-4.7-Flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.7-Flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use zai-org/GLM-4.7-Flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-4.7-Flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-4.7-Flash

SGLang

How to use zai-org/GLM-4.7-Flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-4.7-Flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-4.7-Flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-4.7-Flash with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-4.7-Flash
```

Inference Much slower as compared to other A3B Models

#47

by engrtipusultan - opened Jan 24

Discussion

engrtipusultan

Jan 24

•

edited Jan 25

@ZHANGYUXUAN-zR

I have tested the model speed in llama.cpp on my hardware. It turns out to be much slower than other A3B models. Also pp and tg both drop very quickly and much more than other similar size MoE models. I want to check with you if you have any benchmarks from your internal testing that how model fares with other similar size models for for packet processing and decoding.

I want to understand if it is llama.cpp problem, or vulkan back-end problem or model is like that due to its internal architecture.

llama-bench build: 8f91ca54e (7822)

FA = off

Param	gpt-oss 20B MXFP4 MoE	nemotron_h_moe 31B.A3.5B Q8_0	qwen3moe 30B.A3B Q8_0	GLM4.7 Flash Q8_0
pp512	147.08 ± 1.49	113.02 ± 0.32	119.46 ± 0.18	102.81 ± 0.31
tg128	16.17 ± 0.00	12.05 ± 0.01	12.95 ± 0.01	10.77 ± 0.00
pp512 @ d1024	136.19 ± 1.73	111.24 ± 0.13	105.93 ± 0.34	86.65 ± 0.31
tg128 @ d1024	15.78 ± 0.03	11.84 ± 0.01	12.06 ± 0.06	7.29 ± 0.05
pp512 @ d2048	128.45 ± 1.21	108.86 ± 0.40	94.63 ± 0.51	73.20 ± 0.48
tg128 @ d2048	15.20 ± 0.03	11.50 ± 0.00	11.23 ± 0.00	5.28 ± 0.03
pp512 @ d8096	95.64 ± 0.76	98.47 ± 0.93	56.28 ± 0.18	38.71 ± 0.02
tg128 @ d8096	12.28 ± 0.01	9.17 ± 0.05	5.89 ± 0.05	2.19 ± 0.02

FA = on

params	gpt-oss 20B MXFP4 MoE	nemotron_h_moe 31B.A3.5B Q8_0	qwen3moe 30B.A3B Q8_0	GLM4.7 Flash Q8_0
pp512	146.69 ± 0.87	112.54 ± 0.65	114.26 ± 0.87	86.09 ± 0.12
tg128	16.64 ± 0.01	12.12 ± 0.01	13.39 ± 0.01	10.97 ± 0.01
pp512 @ d1024	132.76 ± 0.39	107.09 ± 0.32	77.43 ± 0.10	50.39 ± 0.10
tg128 @ d1024	16.36 ± 0.08	12.05 ± 0.01	12.29 ± 0.00	9.76 ± 0.01
pp512 @ d2048	120.38 ± 0.10	101.26 ± 0.28	55.47 ± 0.35	35.40 ± 0.02
tg128 @ d2048	16.11 ± 0.08	11.98 ± 0.00	11.66 ± 0.01	8.79 ± 0.00
pp512 @ d8096	77.32 ± 0.34	77.85 ± 0.48	20.76 ± 0.17	12.94 ± 0.01
tg128 @ d8096	14.91 ± 0.01	11.52 ± 0.00	8.92 ± 0.00	5.58 ± 0.00

engrtipusultan changed discussion title from Inference Mush slower as compared to other A3B Models to Inference Much slower as compared to other A3B Models Jan 24

CHNtentes

Jan 26

I suppose you should compare the speed with vllm / sglang, since llama.cpp support is not added by z.ai

engrtipusultan

Jan 26

•

edited Jan 26

I do not have machine to do that. That is why I am asking for results from there internal testing. That will give reference whether it is due to inference engine or model itself is like this. Qwen team gave benchmarks for Qwen3 Next against other similar size models.
So far GPT OSS 20B is king at lower contexts and NVIDIA-Nemotron-3-Nano-30B-A3B is best due to Mamba-2 architecture in terms of retaining PP and TG.

ayylmaonade

Feb 4

Yeah, it's really rough in my experience. I'm praying that the implementations in all the backends improve - this model is potent.

jmander11

Feb 9

In my very limited experience vulkan is a joke and you should never even think about using it. If your gpu only supports vulkan due to age or something then I think this is the issue. You should not be trying to use vulkan as a benchmark for anything, so even if it's notably slower on this model versus others in the same MoE size class it's just not the type of backend you want to be measuring performance against.

I get it if it's all you have, but vulkan is just so brutal. I have an amd r9700 and vulkan was what I first tried when learning llama.cpp/vllm and I was ready to return this card until I went the native rocm route for it.

ayylmaonade

Feb 10

In my very limited experience vulkan is a joke and you should never even think about using it. If your gpu only supports vulkan due to age or something then I think this is the issue. You should not be trying to use vulkan as a benchmark for anything, so even if it's notably slower on this model versus others in the same MoE size class it's just not the type of backend you want to be measuring performance against.

I get it if it's all you have, but vulkan is just so brutal. I have an amd r9700 and vulkan was what I first tried when learning llama.cpp/vllm and I was ready to return this card until I went the native rocm route for it.

ROCm is even worse - flash attention still completely broken, and significantly slower in general even compared to Vulkan with FA off. I think you're off about Vulkan though, it's genuinely the fastest backend (regarding actual TPS, it's slower at prefill) for AMD cards. My RX 7900 XTX with identical settings using Qwen3-30B-A3B-2507 w/ ROCm compared to Vulkan is stark. ~90TPS w/ context length set to 84K using ROCm. ~185 with Vulkan. So I'd recommend giving it a shot if you need faster token generation. For prefill and prompt processing though, it's definitely a bit slower.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment