Instructions to use zai-org/GLM-4.7-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-4.7-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-4.7-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-FP8")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.7-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use zai-org/GLM-4.7-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-4.7-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-4.7-FP8

SGLang

How to use zai-org/GLM-4.7-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-4.7-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-4.7-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-4.7-FP8 with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-4.7-FP8
```

missing the beginning of think tag

by O-delicious - opened Dec 24, 2025

Discussion

O-delicious

Dec 24, 2025

I hosted the model via vllm and already without reasoning_parser, I found the model output with directly output without but having close tag later.

root@iv-ydzbs5zshss6ipm6s5gu /h/n/d/ark_http_proxy# curl --location 'http://localhost/v1/chat/completions' \
                                                    --header 'Authorization: Bearer YOUR_API_KEY' \
                                                    --header 'Content-Type: application/json' \
                                                    --data '{
                                                        "model": "GLM-4.7-FP8", "stream": true,
                                                        "messages": [
                                                            {
                                                                "role": "user",
                                                                "content": "what is cryptography"
                                                            }
                                                        ],"chat_template_kwargs": {"enable_thinking": true}, "skip_special_tokens": false,
                                                        "thinking": {
                                                            "type": "enabled"
                                                        },
                                                        "max_tokens": 1024,
                                                        "temperature": 1.0
                                                    }'
data: {"id":"chatcmpl-9fbc092d919f9e51","object":"chat.completion.chunk","created":1766599479,"model":"GLM-4.7-FP8","choices":[{"index":0,"delta":{"role":"assistant","content":"","reasoning_content":null},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null}

data: {"id":"chatcmpl-9fbc092d919f9e51","object":"chat.completion.chunk","created":1766599479,"model":"GLM-4.7-FP8","choices":[{"index":0,"delta":{"content":"1","reasoning_content":null},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-9fbc092d919f9e51","object":"chat.completion.chunk","created":1766599479,"model":"GLM-4.7-FP8","choices":[{"index":0,"delta":{"content":". ","reasoning_content":null},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-9fbc092d919f9e51","object":"chat.completion.chunk","created":1766599479,"model":"GLM-4.7-FP8","choices":[{"index":0,"delta":{"content":" **An","reasoning_content":null},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-9fbc092d919f9e51","object":"chat.completion.chunk","created":1766599479,"model":"GLM-4.7-FP8","choices":[{"index":0,"delta":{"content":"alyze the","reasoning_content":null},"logprobs":null,"finish_reason":null,"token_ids":null}]}

I confirmed that chat template will

root@iv-ydzbs5zshss6ipm6s5gu /h/n/d/ark_http_proxy# curl -sS 'http://127.0.0.1/tokenize' \
                                                      -H 'Content-Type: application/json' \
                                                      -d '{"model":"GLM-4.7-FP8","messages":[{"role":"user","content":"hi"}],"add_generation_prompt":true,"return_token_strs":true}'
{"count":6,"max_model_len":202752,"tokens":[151331,151333,151336,6023,151337,151350],"token_strs":["[gMASK]","<sop>","<|user|>","hi","<|assistant|>","<think>"]}⏎

O-delicious

Dec 24, 2025

•

edited Dec 24, 2025

I think it is vllm bug. I did a patch and opened a issue https://github.com/vllm-project/vllm/issues/31319

I will wait for vllm team to confirm and close this one.

alzee

Jan 24

•

edited Jan 24

I think it is vllm bug. I did a patch and opened a issue https://github.com/vllm-project/vllm/issues/31319

I don’t think this is vLLM-related. I’m using MindIE to serve GLM-4.7 on the Ascend platform, and I have exactly the same issue. I searched for “GLM-4.7 missing think” and was led here.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment