[Docs] Add LightLLM deployment example

#57
by FubaoSu - opened

Hi @zai-org team,

We have recently added support for GLM-4.7-Flash in LightLLM.

To provide the community with more deployment options, we would like to contribute a brief guide and some performance references to the Model Card. We have implemented and verified the tool calling and reasoning capabilities of this model to ensure a robust user experience.

Performance Reference (TP2)

In our local benchmarks (64 concurrent requests, 8k input / 1k output), LightLLM demonstrates efficient serving capabilities:

  • Throughput: Reaches 18,931 tok/s (~31% optimization over SGLang in this specific setup).
  • Latency: Reduced Mean TPOT by approximately 24%.

Accuracy Reference (BFCL)

On the Berkeley Function Calling Leaderboard:

  • LightLLM Overall Accuracy: 49.12%
  • SGLang Overall Accuracy: 45.41%

Full Result

# Benchmark script
python -m sglang.bench_serving   --backend sglang-oai   --model /dev/shm/GLM-4.7-Flash   --dataset-name random   --random-input-len 8000   --random-output-len 1000   --num-prompts 320   --max-concurrency 64   --request-rate inf


# LightLLM tp2 startup script
python -m lightllm.server.api_server \
    --model_dir /dev/shm/GLM-4.7-Flash/ \
    --tp 2 \
    --max_req_total_len 202752 \
    --port 30000 

# Result
============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     320       
Benchmark duration (s):                  76.27     
Total input tokens:                      1273893   
Total input text tokens:                 1273893   
Total generated tokens:                  170000    
Total generated tokens (retokenized):    169853    
Request throughput (req/s):              4.20      
Input token throughput (tok/s):          16702.93  
Output token throughput (tok/s):         2228.99   
Peak output token throughput (tok/s):    3335.00   
Peak concurrent requests:                71        
Total token throughput (tok/s):          18931.93  
Concurrency:                             59.12     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   14091.45  
Median E2E Latency (ms):                 13633.65  
P90 E2E Latency (ms):                    23682.33  
P99 E2E Latency (ms):                    27589.25  
---------------Time to First Token----------------
Mean TTFT (ms):                          652.54    
Median TTFT (ms):                        177.28    
P99 TTFT (ms):                           3984.77   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          26.01     
Median TPOT (ms):                        26.70     
P99 TPOT (ms):                           38.65     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           25.35     
Median ITL (ms):                         17.52     
P95 ITL (ms):                            90.26     
P99 ITL (ms):                            117.75    
Max ITL (ms):                            3209.75   

# SGLang tp2 startup script
python -m sglang.launch_server \
  --model /dev/shm/GLM-4.7-Flash \
  --attention-backend flashinfer \
  --tp 2
  
# Result
============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     320       
Benchmark duration (s):                  100.25    
Total input tokens:                      1273893   
Total input text tokens:                 1273893   
Total generated tokens:                  170000    
Total generated tokens (retokenized):    169152    
Request throughput (req/s):              3.19      
Input token throughput (tok/s):          12707.48  
Output token throughput (tok/s):         1695.80   
Peak output token throughput (tok/s):    2730.00   
Peak concurrent requests:                71        
Total token throughput (tok/s):          14403.29  
Concurrency:                             58.90     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   18451.65  
Median E2E Latency (ms):                 17985.79  
P90 E2E Latency (ms):                    30891.18  
P99 E2E Latency (ms):                    36007.76  
---------------Time to First Token----------------
Mean TTFT (ms):                          810.07    
Median TTFT (ms):                        186.54    
P99 TTFT (ms):                           5677.85   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          34.24     
Median TPOT (ms):                        34.92     
P99 TPOT (ms):                           55.08     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           33.27     
Median ITL (ms):                         22.77     
P95 ITL (ms):                            104.16    
P99 ITL (ms):                            142.71    
Max ITL (ms):                            5212.68   
==================================================

# LightLLM tp1 startup script
python -m lightllm.server.api_server \
    --model_dir /dev/shm/GLM-4.7-Flash/ \
    --tp 1 \
    --max_req_total_len 202752 \
    --port 30000 

# Result
============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     320       
Benchmark duration (s):                  106.86    
Total input tokens:                      1273893   
Total input text tokens:                 1273893   
Total generated tokens:                  170000    
Total generated tokens (retokenized):    169819    
Request throughput (req/s):              2.99      
Input token throughput (tok/s):          11921.64  
Output token throughput (tok/s):         1590.93   
Peak output token throughput (tok/s):    2528.00   
Peak concurrent requests:                71        
Total token throughput (tok/s):          13512.57  
Concurrency:                             59.28     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   19796.59  
Median E2E Latency (ms):                 19288.46  
P90 E2E Latency (ms):                    33296.09  
P99 E2E Latency (ms):                    38643.51  
---------------Time to First Token----------------
Mean TTFT (ms):                          978.99    
Median TTFT (ms):                        245.46    
P99 TTFT (ms):                           6411.26   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.52     
Median TPOT (ms):                        37.32     
P99 TPOT (ms):                           58.54     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           35.50     
Median ITL (ms):                         23.90     
P95 ITL (ms):                            105.18    
P99 ITL (ms):                            204.22    
Max ITL (ms):                            5473.59   
==================================================

# SGLang tp1 startup script
python -m sglang.launch_server \
  --model /dev/shm/GLM-4.7-Flash \
  --attention-backend flashinfer \
  --tp 1
  
# Result
============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 64        
Successful requests:                     320       
Benchmark duration (s):                  130.16    
Total input tokens:                      1273893   
Total input text tokens:                 1273893   
Total generated tokens:                  170000    
Total generated tokens (retokenized):    169201    
Request throughput (req/s):              2.46      
Input token throughput (tok/s):          9787.45   
Output token throughput (tok/s):         1306.13   
Peak output token throughput (tok/s):    2304.00   
Peak concurrent requests:                70        
Total token throughput (tok/s):          11093.57  
Concurrency:                             59.28     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   24110.17  
Median E2E Latency (ms):                 23328.84  
P90 E2E Latency (ms):                    40500.42  
P99 E2E Latency (ms):                    47501.98  
---------------Time to First Token----------------
Mean TTFT (ms):                          1168.04   
Median TTFT (ms):                        275.58    
P99 TTFT (ms):                           8623.51   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          44.74     
Median TPOT (ms):                        45.52     
P99 TPOT (ms):                           78.30     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           43.26     
Median ITL (ms):                         27.99     
P95 ITL (ms):                            134.90    
P99 ITL (ms):                            217.75    
Max ITL (ms):                            8390.66   
==================================================

# LightLLM bfcl eval result
============================================================
SUMMARY : LightLLM
============================================================
Category                     Total   Passed   Accuracy
------------------------------------------------------------
simple                         400      250     62.50%
multiple                       200      109     54.50%
parallel                       200      139     69.50%
parallel_multiple              200      123     61.50%
java                           100       66     66.00%
javascript                      50       24     48.00%
irrelevance                    240      200     83.33%
live_simple                    258      118     45.74%
live_multiple                 1053      358     34.00%
live_parallel                   16        4     25.00%
live_parallel_multiple          24        9     37.50%
rest                            70        2      2.86%
sql                            100       28     28.00%
------------------------------------------------------------
OVERALL                       2911     1430     49.12%
============================================================

# Sglang bfcl eval result
============================================================
SUMMARY : SGLang
============================================================
Category                     Total   Passed   Accuracy
------------------------------------------------------------
simple                         400      244     61.00%
multiple                       200      109     54.50%
parallel                       200      144     72.00%
parallel_multiple              200      121     60.50%
java                           100        4      4.00%
javascript                      50        1      2.00%
irrelevance                    240      200     83.33%
live_simple                    258      114     44.19%
live_multiple                 1053      347     32.95%
live_parallel                   16        2     12.50%
live_parallel_multiple          24        8     33.33%
rest                            70        3      4.29%
sql                            100       25     25.00%
------------------------------------------------------------
OVERALL                       2911     1322     45.41%
============================================================
FubaoSu changed pull request title from [Docs] Add LightLLM deployment example (faster than SGLang) to [Docs] Add LightLLM deployment example

Trying to run on 4xH100, with both ghcr.io/modeltc/lightllm:main and ghcr.io/modeltc/lightllm:main-deepep images.

But get this error

WARNING 01-30 14:43:25 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it.                    You can solve it by running `pip install sgl_kernel`.
WARNING 01-30 14:43:25 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-30 14:43:25 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
INFO 01-30 14:43:26 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
terminate called after throwing an instance of 'std::length_error'
  what():  vector::reserve

Trying to run on 4xH100, with both ghcr.io/modeltc/lightllm:main and ghcr.io/modeltc/lightllm:main-deepep images.

But get this error

WARNING 01-30 14:43:25 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it.                    You can solve it by running `pip install sgl_kernel`.
WARNING 01-30 14:43:25 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3.         Try to upgrade it.
WARNING 01-30 14:43:25 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
INFO 01-30 14:43:26 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
terminate called after throwing an instance of 'std::length_error'
  what():  vector::reserve

Thank you for your feedback. We are currently fixing this issue. You can temporarily use this image : docker pull jyily/lightllm:cu129-78cc66a

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment