Instructions to use zai-org/GLM-4.7-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-4.7-Flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-4.7-Flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.7-Flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use zai-org/GLM-4.7-Flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-4.7-Flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-4.7-Flash

SGLang

How to use zai-org/GLM-4.7-Flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-4.7-Flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-4.7-Flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-4.7-Flash with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-4.7-Flash
```

Why does the KV cache occupy so much GPU memory?

#21

by yyg201708 - opened Jan 20

Discussion

yyg201708

Jan 20

My GPU setup is 48*2, and the maximum number of tokens I can successfully launch is only slightly over 20K.

vllm serve /home/tester/.cache/huggingface/hub/models--zai-org--GLM-4.7-Flash/snapshots/279ecdf8ee35f17f1939f95d6b113d8b806a7b2b \
     --tensor-parallel-size 2 \
     --swap-space 4 \
    --max-model-len 27600 \
     --speculative-config.method mtp \
     --speculative-config.num_speculative_tokens 1 \
     --tool-call-parser glm47 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --served-model-name glm_47_flash \
     --port 7777 \
     --api-key 12345678 \
     --gpu-memory-utilization 0.92

nephepritou

Jan 20

VLLM don't use MLA. Try SGLang from model readme. It will allow ~120K tokens. But generation speed will not make you happy :(

keennay

Jan 20

Apparently

VLLM don't use MLA. Try SGLang from model readme. It will allow ~120K tokens. But generation speed will not make you happy :(

is SGLang that bad compared to vLLM? o__O

ubergarm

Jan 20

Some discussion on it here and links to both performance and quantization quality benchmarks: https://huggingface.co/zai-org/GLM-4.7-Flash/discussions/3#696ff1a8a2f6682f043decc3

ik_llama.cpp supports MLA with GLM-4.7-Flash

ZHANGYUXUAN-zR

Z.ai org Jan 21

vLLM is not correctly triggering MLA, following up on this

Dampfinchen

Jan 21

Does regular llama.cpp support MLA? Anyway, it is great that Flash supports MLA, support for it will come, hopefully.

ubergarm

Jan 21

@Dampfinchen

Does regular llama.cpp support MLA?

It supports MLA for deepseek and kimi, but yesterday I couldn't get it working with GLM-4.7-Flash myself on CUDA backend. I think this is the PR you want to be tracking: https://github.com/ggml-org/llama.cpp/pull/18953

loverjanime

Jan 22

•

edited Jan 22

any smaller model for 3050 4gb vram ??
i used qwen for generating srt. from transcribed files from whisper AI its Chinese audio to English subtitle setup . but qwen doing note insertion inside srt subtitle file making srt unusable tried argos it much worse for chinese to english translation... any tips how to tackle this i am new to this....

Akicou

Jan 22

any smaller model for 3050 4gb vram ??
i used qwen for generating srt. from transcribed files from whisper AI its Chinese audio to English subtitle setup . but qwen doing note insertion inside srt subtitle file making srt unusable tried argos it much worse for chinese to english translation... any tips how to tackle this i am new to this....

I have made REAP'd versions of the model all they way up to 49% compression incase youre interested

anjeysapkovski

Jan 24

@Akicou great work. Did you test REAP models in benchmarks? How much does performance drop?

Akicou

Jan 25

Thank you.

However No i don't do benchmarks. i only reap and make sure the output isn't total garbage. however i have seen some huge performance drop and weirder output even if i used the recommended ones by unsloth / z.ai in the 30, 40 and 50 models

loverjanime

Jan 25

•

edited Jan 25

any smaller model for 3050 4gb vram ??
i used qwen for generating srt. from transcribed files from whisper AI its Chinese audio to English subtitle setup . but qwen doing note insertion inside srt subtitle file making srt unusable tried argos it much worse for chinese to english translation... any tips how to tackle this i am new to this....

I have made REAP'd versions of the model all they way up to 49% compression incase youre interested

Yeah i would love to try it out what is the size and how do i can get it?? On hugging face?? Which setup you used this on??

Akicou

Jan 25

•

edited Jan 25

any smaller model for 3050 4gb vram ??
i used qwen for generating srt. from transcribed files from whisper AI its Chinese audio to English subtitle setup . but qwen doing note insertion inside srt subtitle file making srt unusable tried argos it much worse for chinese to english translation... any tips how to tackle this i am new to this....

I have made REAP'd versions of the model all they way up to 49% compression incase youre interested

Yeah i would love to try it out what is the size and how do i can get it?? On hugging face?? Which setup you used this on??

Yeah on my huggingface. I used 3x RTX A5000 to test out 3 of them on runpod. I used 2x A100 PCIe to reap the models

Safetensors:

the estimate of parameters is what huggingface is showing

I also made GGUF quants but llama cpp needs to fix some problems because regardless of whether its the reap or the unsloth quants. the gguf model somehow forgets the history or atleast for me

Akicou

Jan 25

Cerebras also has their own reap with benchmarks

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment