Instructions to use zai-org/GLM-4.7-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-4.7-Flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-4.7-Flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.7-Flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use zai-org/GLM-4.7-Flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-4.7-Flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-4.7-Flash

SGLang

How to use zai-org/GLM-4.7-Flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-4.7-Flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-4.7-Flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-4.7-Flash with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-4.7-Flash
```

[Docs] Add LightLLM deployment example

#57

by FubaoSu - opened Jan 30

base: refs/heads/main

←

from: refs/pr/57

Discussion Files changed

+23

-0

[Docs] Add LightLLM deployment example (faster than SGLang)e10500a0

FubaoSu

Jan 30

Hi @zai-org team,

We have recently added support for GLM-4.7-Flash in LightLLM.

To provide the community with more deployment options, we would like to contribute a brief guide and some performance references to the Model Card. We have implemented and verified the tool calling and reasoning capabilities of this model to ensure a robust user experience.

Performance Reference (TP2)

In our local benchmarks (64 concurrent requests, 8k input / 1k output), LightLLM demonstrates efficient serving capabilities:

Throughput: Reaches 18,931 tok/s (~31% optimization over SGLang in this specific setup).
Latency: Reduced Mean TPOT by approximately 24%.

Accuracy Reference (BFCL)

On the Berkeley Function Calling Leaderboard:

LightLLM Overall Accuracy: 49.12%
SGLang Overall Accuracy: 45.41%

Full Result

# Benchmark script
python -m sglang.bench_serving   --backend sglang-oai   --model /dev/shm/GLM-4.7-Flash   --dataset-name random   --random-input-len 8000   --random-output-len 1000   --num-prompts 320   --max-concurrency 64   --request-rate inf


# LightLLM tp2 startup script
python -m lightllm.server.api_server \
    --model_dir /dev/shm/GLM-4.7-Flash/ \
    --tp 2 \
    --max_req_total_len 202752 \
    --port 30000 

# Result
============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     320       
Benchmark duration (s):                  76.27     
Total input tokens:                      1273893   
Total input text tokens:                 1273893   
Total generated tokens:                  170000    
Total generated tokens (retokenized):    169853    
Request throughput (req/s):              4.20      
Input token throughput (tok/s):          16702.93  
Output token throughput (tok/s):         2228.99   
Peak output token throughput (tok/s):    3335.00   
Peak concurrent requests:                71        
Total token throughput (tok/s):          18931.93  
Concurrency:                             59.12     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   14091.45  
Median E2E Latency (ms):                 13633.65  
P90 E2E Latency (ms):                    23682.33  
P99 E2E Latency (ms):                    27589.25  
---------------Time to First Token----------------
Mean TTFT (ms):                          652.54    
Median TTFT (ms):                        177.28    
P99 TTFT (ms):                           3984.77   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          26.01     
Median TPOT (ms):                        26.70     
P99 TPOT (ms):                           38.65     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           25.35     
Median ITL (ms):                         17.52     
P95 ITL (ms):                            90.26     
P99 ITL (ms):                            117.75    
Max ITL (ms):                            3209.75   

# SGLang tp2 startup script
python -m sglang.launch_server \
  --model /dev/shm/GLM-4.7-Flash \
  --attention-backend flashinfer \
  --tp 2
  
# Result
============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     320       
Benchmark duration (s):                  100.25    
Total input tokens:                      1273893   
Total input text tokens:                 1273893   
Total generated tokens:                  170000    
Total generated tokens (retokenized):    169152    
Request throughput (req/s):              3.19      
Input token throughput (tok/s):          12707.48  
Output token throughput (tok/s):         1695.80   
Peak output token throughput (tok/s):    2730.00   
Peak concurrent requests:                71        
Total token throughput (tok/s):          14403.29  
Concurrency:                             58.90     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   18451.65  
Median E2E Latency (ms):                 17985.79  
P90 E2E Latency (ms):                    30891.18  
P99 E2E Latency (ms):                    36007.76  
---------------Time to First Token----------------
Mean TTFT (ms):                          810.07    
Median TTFT (ms):                        186.54    
P99 TTFT (ms):                           5677.85   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          34.24     
Median TPOT (ms):                        34.92     
P99 TPOT (ms):                           55.08     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           33.27     
Median ITL (ms):                         22.77     
P95 ITL (ms):                            104.16    
P99 ITL (ms):                            142.71    
Max ITL (ms):                            5212.68   
==================================================

# LightLLM tp1 startup script
python -m lightllm.server.api_server \
    --model_dir /dev/shm/GLM-4.7-Flash/ \
    --tp 1 \
    --max_req_total_len 202752 \
    --port 30000 

# Result
============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     320       
Benchmark duration (s):                  106.86    
Total input tokens:                      1273893   
Total input text tokens:                 1273893   
Total generated tokens:                  170000    
Total generated tokens (retokenized):    169819    
Request throughput (req/s):              2.99      
Input token throughput (tok/s):          11921.64  
Output token throughput (tok/s):         1590.93   
Peak output token throughput (tok/s):    2528.00   
Peak concurrent requests:                71        
Total token throughput (tok/s):          13512.57  
Concurrency:                             59.28     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   19796.59  
Median E2E Latency (ms):                 19288.46  
P90 E2E Latency (ms):                    33296.09  
P99 E2E Latency (ms):                    38643.51  
---------------Time to First Token----------------
Mean TTFT (ms):                          978.99    
Median TTFT (ms):                        245.46    
P99 TTFT (ms):                           6411.26   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.52     
Median TPOT (ms):                        37.32     
P99 TPOT (ms):                           58.54     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           35.50     
Median ITL (ms):                         23.90     
P95 ITL (ms):                            105.18    
P99 ITL (ms):                            204.22    
Max ITL (ms):                            5473.59   
==================================================

# SGLang tp1 startup script
python -m sglang.launch_server \
  --model /dev/shm/GLM-4.7-Flash \
  --attention-backend flashinfer \
  --tp 1
  
# Result
============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     320       
Benchmark duration (s):                  130.16    
Total input tokens:                      1273893   
Total input text tokens:                 1273893   
Total generated tokens:                  170000    
Total generated tokens (retokenized):    169201    
Request throughput (req/s):              2.46      
Input token throughput (tok/s):          9787.45   
Output token throughput (tok/s):         1306.13   
Peak output token throughput (tok/s):    2304.00   
Peak concurrent requests:                70        
Total token throughput (tok/s):          11093.57  
Concurrency:                             59.28     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   24110.17  
Median E2E Latency (ms):                 23328.84  
P90 E2E Latency (ms):                    40500.42  
P99 E2E Latency (ms):                    47501.98  
---------------Time to First Token----------------
Mean TTFT (ms):                          1168.04   
Median TTFT (ms):                        275.58    
P99 TTFT (ms):                           8623.51   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          44.74     
Median TPOT (ms):                        45.52     
P99 TPOT (ms):                           78.30     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           43.26     
Median ITL (ms):                         27.99     
P95 ITL (ms):                            134.90    
P99 ITL (ms):                            217.75    
Max ITL (ms):                            8390.66   
==================================================

# LightLLM bfcl eval result
============================================================
SUMMARY : LightLLM
============================================================
Category                     Total   Passed   Accuracy
------------------------------------------------------------
simple                         400      250     62.50%
multiple                       200      109     54.50%
parallel                       200      139     69.50%
parallel_multiple              200      123     61.50%
java                           100       66     66.00%
javascript                      50       24     48.00%
irrelevance                    240      200     83.33%
live_simple                    258      118     45.74%
live_multiple                 1053      358     34.00%
live_parallel                   16        4     25.00%
live_parallel_multiple          24        9     37.50%
rest                            70        2      2.86%
sql                            100       28     28.00%
------------------------------------------------------------
OVERALL                       2911     1430     49.12%
============================================================

# Sglang bfcl eval result
============================================================
SUMMARY : SGLang
============================================================
Category                     Total   Passed   Accuracy
------------------------------------------------------------
simple                         400      244     61.00%
multiple                       200      109     54.50%
parallel                       200      144     72.00%
parallel_multiple              200      121     60.50%
java                           100        4      4.00%
javascript                      50        1      2.00%
irrelevance                    240      200     83.33%
live_simple                    258      114     44.19%
live_multiple                 1053      347     32.95%
live_parallel                   16        2     12.50%
live_parallel_multiple          24        8     33.33%
rest                            70        3      4.29%
sql                            100       25     25.00%
------------------------------------------------------------
OVERALL                       2911     1322     45.41%
============================================================

FubaoSu changed pull request title from [Docs] Add LightLLM deployment example (faster than SGLang) to [Docs] Add LightLLM deployment example Jan 30

batalovme

Jan 30

Trying to run on 4xH100, with both ghcr.io/modeltc/lightllm:main and ghcr.io/modeltc/lightllm:main-deepep images.

But get this error

WARNING 01-30 14:43:25 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it.                    You can solve it by running `pip install sgl_kernel`.
WARNING 01-30 14:43:25 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-30 14:43:25 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
INFO 01-30 14:43:26 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
terminate called after throwing an instance of 'std::length_error'
  what():  vector::reserve

FubaoSu

Jan 31

•

edited Jan 31

Trying to run on 4xH100, with both ghcr.io/modeltc/lightllm:main and ghcr.io/modeltc/lightllm:main-deepep images.

But get this error

WARNING 01-30 14:43:25 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it.                    You can solve it by running `pip install sgl_kernel`.
WARNING 01-30 14:43:25 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-30 14:43:25 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
INFO 01-30 14:43:26 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
terminate called after throwing an instance of 'std::length_error'
  what():  vector::reserve

Thank you for your feedback. We are currently fixing this issue. You can temporarily use this image : docker pull jyily/lightllm:cu129-78cc66a

Update README.md7b67ddb6

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment