Instructions to use zai-org/GLM-4.7-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-4.7-Flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-4.7-Flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.7-Flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use zai-org/GLM-4.7-Flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-4.7-Flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-4.7-Flash

SGLang

How to use zai-org/GLM-4.7-Flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-4.7-Flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-4.7-Flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-4.7-Flash with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-4.7-Flash
```

Enormous KV-cache size?

by nephepritou - opened Jan 19

Discussion

nephepritou

Jan 19

Just want to verify I'm doing everything correct and it's just huge KV-cache size and not my mistake with VLLM configuration.

python3 -m vllm.entrypoints.openai.api_server \
    --model /mnt/data/llm-data/models/zai-org/GLM-4.7-Flash \
    --served-model-name glm-4.7-flash \
    --tensor-parallel-size 4 \
    --max-model-len 131072 \
    --gpu-memory-utilization 0.92 \
    --max-num-seqs 2 \
    --enable-auto-tool-choice \
    --tool-call-parser glm47 \
    --reasoning-parser glm45

And I got an error:

To serve at least one request with the models's max seq len (131072), (29.38 GiB KV cache is needed, which is larger than the available KV cache memory (7.29 GiB). Based on the available memory, the estimated maximum model length is 32528. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.

It's was a surprise because I've though it will consume not much more than Qwen3 Coder 30B, and for Qwen it fit 280K tokens into same 4x RTX 3090.

freegheist

Jan 19

try changing -max-num-seqs to 1 and increase --gpu-memory-utilization

zenmagnets

Jan 19

•

edited Jan 19

I'm on a single RTX 6000 pro, for the same 131072 it's asking for 120GiB KV! How is your config only requesting 29.38 GiB for 131k tok?
Here's what i'm getting:
ValueError: To serve at least one request with the models's max seq len (131072), (120.0 GiB KV cache is needed, which is larger than the available KV cache memory (31.83 GiB). Based on the available memory, the estimated maximum model length is 34768. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.

My config:

MODEL_ID="zai-org/GLM-4.7-Flash"

vllm serve "${MODEL_ID}" \
  --host 0.0.0.0 \
  --port 1236 \
  --gpu-memory-utilization 0.96 \
  --max-model-len 131072 \
  --max-num-seqs 32 \
  --tensor-parallel-size 1 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 1 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash

nephepritou

Jan 19

I'm on a single RTX 6000 pro, for the same 131072 it's asking for 120GiB KV! How is your config only requesting 29.38 GiB for 131k tok?
Here's what i'm getting:
ValueError: To serve at least one request with the models's max seq len (131072), (120.0 GiB KV cache is needed, which is larger than the available KV cache memory (31.83 GiB). Based on the available memory, the estimated maximum model length is 34768. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.

My config:
MODEL_ID="zai-org/GLM-4.7-Flash"

vllm serve "${MODEL_ID}" \
  --host 0.0.0.0 \
  --port 1236 \
  --gpu-memory-utilization 0.96 \
  --max-model-len 131072 \
  --max-num-seqs 32 \
  --tensor-parallel-size 1 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 1 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-flash

Because it requested 30Gb on EACH card. So 120Gb total.

zenmagnets

Jan 19

@nepherpritou
Guessing you're on 4x 3090?
There's an FP8 and NVFP4 version out that can leave more room for context, but it's still 0.91MB KV Cache per token unless you want to quantize KV Cache.

ubergarm

Jan 19

•

edited Jan 19

It seems like even on mainline llama.cpp this model is taking up a lot of space with the compute buffer e.g. 68k context using almost 24GB on the compute buffer alone huh...

Test GGUF quant with more details including initial benchmarks (before I OOMd): https://huggingface.co/ubergarm/GLM-4.7-Flash-GGUF

CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 69632 \
  -fit off \
  -fa off \
  -ngl 99 \
  -ub 4096 -b 4096 \
  --threads 1

llama_context: n_ctx_seq (69632) < n_ctx_train (202752) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.59 MiB
llama_kv_cache:      CUDA0 KV buffer size =  6791.50 MiB
llama_kv_cache: size = 6791.50 MiB ( 69632 cells,  47 layers,  1/1 seqs), K (f16): 3595.50 MiB, V (f16): 3196.00 MiB
sched_reserve: reserving ...
sched_reserve:      CUDA0 compute buffer size = 23431.06 MiB
sched_reserve:  CUDA_Host compute buffer size =  1136.08 MiB
sched_reserve: graph nodes  = 3504
sched_reserve: graph splits = 2
sched_reserve: reserve took 481.45 ms, sched copies = 1

Have you vllm folks also noticed token generation speed (TG aka decode aka TPOT) dropping quickly as context/kv-cache grows? oh might be an implementation issue, still looking: https://github.com/ggml-org/llama.cpp/issues/18944

zenmagnets

Jan 20

Seems that the model should be using MLA, but isn't, and thus is falling back to MHA. If MLA were working properly, would be 54KB KV Footprint per token, instead of ~0.91MB per token.

nephepritou

Jan 20

Seems that the model should be using MLA, but isn't, and thus is falling back to MHA. If MLA were working properly, would be 54KB KV Footprint per token, instead of ~0.91MB per token.

What do you think - is this VLLM implementation issue or just design decision? I need to know should I hope and check it periodically, maybe invest some time into VLLM contribution, or just skip this release as bad deal for my 4*24 setup?

theo77186

Jan 20

It's just a matter of writing the missing MLA kernels as they don't have the same head count/sizes. For llama.cpp, there is already a PR (https://github.com/ggml-org/llama.cpp/pull/18953) covering the missing case for attention kernels.

nephepritou

Jan 20

Holy shit! I was able to edit single line to enable MLA attention and output looks coherent.
Added glm4_moe_lite here https://github.com/vllm-project/vllm/blob/main/vllm/transformers_utils/model_arch_config_convertor.py#L196

(Worker_TP0_EP0 pid=3377412)[0;0m INFO 01-20 14:12:06 [unquantized.py:82] FlashInfer CUTLASS MoE is available for EP but not enabled, consider setting VLLM_USE_FLASHINFER_MOE_FP16=1 to enable it.
(Worker_TP0_EP0 pid=3377412)[0;0m INFO 01-20 14:12:06 [unquantized.py:103] Using TRITON backend for Unquantized MoE
...
(Worker_TP0_EP0 pid=3377412)[0;0m INFO 01-20 14:13:40 [gpu_worker.py:355] Available KV cache memory: 7.15 GiB
(EngineCore_DP0 pid=3377201)[0;0m INFO 01-20 14:13:41 [kv_cache_utils.py:1307] GPU KV cache size: 137,968 tokens

zenmagnets

Jan 20

@nepthepritou, did you modify a file on your system? Or rebuild an entire fork of vLLM with that edit?

ubergarm

Jan 20

@zenmagnets

Ahh yes, that is likely what is going on. MLA is working now over on ik_llama.cpp with my GGUF quants here: https://huggingface.co/ubergarm/GLM-4.7-Flash-GGUF

I'll update the model card with some example commands and a few more quant options asap! Thanks!

nephepritou

Jan 20

@nepthepritou, did you modify a file on your system? Or rebuild an entire fork of vLLM with that edit?

Just apply changes to python file in venv. Yeah-yeah, not-so-nice way to do things, but good enough for quick check. After initial tests, model behave similar to "original". Unfortunately, both SGLang and VLLM implementation has major flaw, likely related to my own setup. SGLang tends to fall into infinite generation OR even hang on 1-3 GPUs of 4 with 100% usage (feels like memory overheat/corruption). Both SGLang and VLLM completely incoherent in Russian without MTP. With MTP it stay coherent until 3-4K context size, then collapses into infinite generation or put reasoning inside of content output.

I've checked both vllm and sglang on Qwen3 Coder 30b FP16 - robust and coherent results for any language at context size up to 30K tokens (didn't had time to continue further and it completes task anyways).

nephepritou

Jan 20

Also, model loses generation speed with context grow at incredible rates - starting from 100tps at 0 KV cache going to 7tps at 20K KV cache.

attashe

Jan 20

@nephepritou

This patch is working but returns an infinite loop on some prompts

ubergarm

Jan 20

•

edited Jan 20

@nephepritou

ik_llama.cpp has flash attention implementation working with GLM-4.7-Flash so it is a little better and doesn't use as much VRAM for compute buffer in my early testing. The graph shows it slowing down both PP (prefill/TTFT) and TG (decode/TPOT) with longer kv-cache (context) depth:

More discussion here: https://github.com/ikawrakow/ik_llama.cpp/issues/1167#issuecomment-3775037120

ZHANGYUXUAN-zR

Z.ai org Jan 21

vLLM didn't trigger MLA. Before the release, I only checked for accuracy and overlooked this important issue. I'm following up quickly on it

meganoob1337

Jan 21

(Worker_TP0_EP0 pid=3377412)[0;0m INFO 01-20 14:12:06 [unquantized.py:82] FlashInfer CUTLASS MoE is available for EP but not enabled, consider setting VLLM_USE_FLASHINFER_MOE_FP16=1 to enable it.
(Worker_TP0_EP0 pid=3377412)[0;0m INFO 01-20 14:12:06 [unquantized.py:103] Using TRITON backend for Unquantized MoE
...
(Worker_TP0_EP0 pid=3377412)[0;0m INFO 01-20 14:13:40 [gpu_worker.py:355] Available KV cache memory: 7.15 GiB
(EngineCore_DP0 pid=3377201)[0;0m INFO 01-20 14:13:41 [kv_cache_utils.py:1307] GPU KV cache size: 137,968 tokens

for people who want to try this way
here a dockerfile that installs transformers and adds the glm_moe_lite

FROM vllm/vllm-openai:nightly
RUN pip install -U --pre "transformers>=5.0.0rc3"
RUN sed -i 's/^\([[:space:]]*\)"pangu_ultra_moe_mtp",/\1"pangu_ultra_moe_mtp",\n\1"glm4_moe_lite",/' /usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/model_arch_config_convertor.py

ZHANGYUXUAN-zR

Z.ai org Jan 21

You can view the simple changes here. It is currently being verified for correctness, and once correct, it will be merged.

Max-NexaEthos

Jan 21

By manually patching vLLM I was able to make it work on a G292-Z20 with 8x A2000 12GB (tp 4 pp 2), with 32768 context length, but at no more than 9 tps, It seems very weird to me, on the same server I can run GLM 4.5 Air awq at 131072 ctx and 80/90 tps, I assume that the 9tps is not supposed to be the intended speed for a flash model. Should we expect better performance after the fix?

Hivsem

Jan 22

Same for both GLM-4.7 Flash and FP8 on VLLM with VLLM 0.14.0 RC and transformers patch update: 32768 context length requires 24 GB, can run only 2048 context without JSON output

Tested on:
1x 4090 with extended 48 VRAM

ramachandrajoshi

Jan 24

same issue with AWQ 4bit quant also with vllm. only 14tps im able to achieve with model. anyone able to get at least 50tps?

nephepritou

Jan 24

same issue with AWQ 4bit quant also with vllm. only 14tps im able to achieve with model. anyone able to get at least 50tps?

https://github.com/ikawrakow/ik_llama.cpp/pull/1182 - check this one. You will get a lot of tps, but only for single batch. And with growing context size performance will drop. I've tested vllm and sglang - performance drop on long contexts is also huge, but it's starting from 2x lower values as well.

But without MLA performance is Ok'yish. Still dropping significantly. Looks like we need to wait for either engine optimizations or another models for our home usage :(

ramachandrajoshi

Jan 24

great. any docker build for this merge or latest one?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment