[Docs] Add LightLLM deployment example
Hi @zai-org team,
We have recently added support for GLM-4.7-Flash in LightLLM.
To provide the community with more deployment options, we would like to contribute a brief guide and some performance references to the Model Card. We have implemented and verified the tool calling and reasoning capabilities of this model to ensure a robust user experience.
Performance Reference (TP2)
In our local benchmarks (64 concurrent requests, 8k input / 1k output), LightLLM demonstrates efficient serving capabilities:
- Throughput: Reaches 18,931 tok/s (~31% optimization over SGLang in this specific setup).
- Latency: Reduced Mean TPOT by approximately 24%.
Accuracy Reference (BFCL)
On the Berkeley Function Calling Leaderboard:
- LightLLM Overall Accuracy: 49.12%
- SGLang Overall Accuracy: 45.41%
Full Result
# Benchmark script
python -m sglang.bench_serving --backend sglang-oai --model /dev/shm/GLM-4.7-Flash --dataset-name random --random-input-len 8000 --random-output-len 1000 --num-prompts 320 --max-concurrency 64 --request-rate inf
# LightLLM tp2 startup script
python -m lightllm.server.api_server \
--model_dir /dev/shm/GLM-4.7-Flash/ \
--tp 2 \
--max_req_total_len 202752 \
--port 30000
# Result
============ Serving Benchmark Result ============
Backend: sglang-oai
Traffic request rate: inf
Max request concurrency: 64
Successful requests: 320
Benchmark duration (s): 76.27
Total input tokens: 1273893
Total input text tokens: 1273893
Total generated tokens: 170000
Total generated tokens (retokenized): 169853
Request throughput (req/s): 4.20
Input token throughput (tok/s): 16702.93
Output token throughput (tok/s): 2228.99
Peak output token throughput (tok/s): 3335.00
Peak concurrent requests: 71
Total token throughput (tok/s): 18931.93
Concurrency: 59.12
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 14091.45
Median E2E Latency (ms): 13633.65
P90 E2E Latency (ms): 23682.33
P99 E2E Latency (ms): 27589.25
---------------Time to First Token----------------
Mean TTFT (ms): 652.54
Median TTFT (ms): 177.28
P99 TTFT (ms): 3984.77
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 26.01
Median TPOT (ms): 26.70
P99 TPOT (ms): 38.65
---------------Inter-Token Latency----------------
Mean ITL (ms): 25.35
Median ITL (ms): 17.52
P95 ITL (ms): 90.26
P99 ITL (ms): 117.75
Max ITL (ms): 3209.75
# SGLang tp2 startup script
python -m sglang.launch_server \
--model /dev/shm/GLM-4.7-Flash \
--attention-backend flashinfer \
--tp 2
# Result
============ Serving Benchmark Result ============
Backend: sglang-oai
Traffic request rate: inf
Max request concurrency: 64
Successful requests: 320
Benchmark duration (s): 100.25
Total input tokens: 1273893
Total input text tokens: 1273893
Total generated tokens: 170000
Total generated tokens (retokenized): 169152
Request throughput (req/s): 3.19
Input token throughput (tok/s): 12707.48
Output token throughput (tok/s): 1695.80
Peak output token throughput (tok/s): 2730.00
Peak concurrent requests: 71
Total token throughput (tok/s): 14403.29
Concurrency: 58.90
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 18451.65
Median E2E Latency (ms): 17985.79
P90 E2E Latency (ms): 30891.18
P99 E2E Latency (ms): 36007.76
---------------Time to First Token----------------
Mean TTFT (ms): 810.07
Median TTFT (ms): 186.54
P99 TTFT (ms): 5677.85
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 34.24
Median TPOT (ms): 34.92
P99 TPOT (ms): 55.08
---------------Inter-Token Latency----------------
Mean ITL (ms): 33.27
Median ITL (ms): 22.77
P95 ITL (ms): 104.16
P99 ITL (ms): 142.71
Max ITL (ms): 5212.68
==================================================
# LightLLM tp1 startup script
python -m lightllm.server.api_server \
--model_dir /dev/shm/GLM-4.7-Flash/ \
--tp 1 \
--max_req_total_len 202752 \
--port 30000
# Result
============ Serving Benchmark Result ============
Backend: sglang-oai
Traffic request rate: inf
Max request concurrency: 64
Successful requests: 320
Benchmark duration (s): 106.86
Total input tokens: 1273893
Total input text tokens: 1273893
Total generated tokens: 170000
Total generated tokens (retokenized): 169819
Request throughput (req/s): 2.99
Input token throughput (tok/s): 11921.64
Output token throughput (tok/s): 1590.93
Peak output token throughput (tok/s): 2528.00
Peak concurrent requests: 71
Total token throughput (tok/s): 13512.57
Concurrency: 59.28
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 19796.59
Median E2E Latency (ms): 19288.46
P90 E2E Latency (ms): 33296.09
P99 E2E Latency (ms): 38643.51
---------------Time to First Token----------------
Mean TTFT (ms): 978.99
Median TTFT (ms): 245.46
P99 TTFT (ms): 6411.26
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 36.52
Median TPOT (ms): 37.32
P99 TPOT (ms): 58.54
---------------Inter-Token Latency----------------
Mean ITL (ms): 35.50
Median ITL (ms): 23.90
P95 ITL (ms): 105.18
P99 ITL (ms): 204.22
Max ITL (ms): 5473.59
==================================================
# SGLang tp1 startup script
python -m sglang.launch_server \
--model /dev/shm/GLM-4.7-Flash \
--attention-backend flashinfer \
--tp 1
# Result
============ Serving Benchmark Result ============
Backend: sglang-oai
Traffic request rate: inf
Max request concurrency: 64
Successful requests: 320
Benchmark duration (s): 130.16
Total input tokens: 1273893
Total input text tokens: 1273893
Total generated tokens: 170000
Total generated tokens (retokenized): 169201
Request throughput (req/s): 2.46
Input token throughput (tok/s): 9787.45
Output token throughput (tok/s): 1306.13
Peak output token throughput (tok/s): 2304.00
Peak concurrent requests: 70
Total token throughput (tok/s): 11093.57
Concurrency: 59.28
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 24110.17
Median E2E Latency (ms): 23328.84
P90 E2E Latency (ms): 40500.42
P99 E2E Latency (ms): 47501.98
---------------Time to First Token----------------
Mean TTFT (ms): 1168.04
Median TTFT (ms): 275.58
P99 TTFT (ms): 8623.51
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 44.74
Median TPOT (ms): 45.52
P99 TPOT (ms): 78.30
---------------Inter-Token Latency----------------
Mean ITL (ms): 43.26
Median ITL (ms): 27.99
P95 ITL (ms): 134.90
P99 ITL (ms): 217.75
Max ITL (ms): 8390.66
==================================================
# LightLLM bfcl eval result
============================================================
SUMMARY : LightLLM
============================================================
Category Total Passed Accuracy
------------------------------------------------------------
simple 400 250 62.50%
multiple 200 109 54.50%
parallel 200 139 69.50%
parallel_multiple 200 123 61.50%
java 100 66 66.00%
javascript 50 24 48.00%
irrelevance 240 200 83.33%
live_simple 258 118 45.74%
live_multiple 1053 358 34.00%
live_parallel 16 4 25.00%
live_parallel_multiple 24 9 37.50%
rest 70 2 2.86%
sql 100 28 28.00%
------------------------------------------------------------
OVERALL 2911 1430 49.12%
============================================================
# Sglang bfcl eval result
============================================================
SUMMARY : SGLang
============================================================
Category Total Passed Accuracy
------------------------------------------------------------
simple 400 244 61.00%
multiple 200 109 54.50%
parallel 200 144 72.00%
parallel_multiple 200 121 60.50%
java 100 4 4.00%
javascript 50 1 2.00%
irrelevance 240 200 83.33%
live_simple 258 114 44.19%
live_multiple 1053 347 32.95%
live_parallel 16 2 12.50%
live_parallel_multiple 24 8 33.33%
rest 70 3 4.29%
sql 100 25 25.00%
------------------------------------------------------------
OVERALL 2911 1322 45.41%
============================================================
Trying to run on 4xH100, with both ghcr.io/modeltc/lightllm:main and ghcr.io/modeltc/lightllm:main-deepep images.
But get this error
WARNING 01-30 14:43:25 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it. You can solve it by running `pip install sgl_kernel`.
WARNING 01-30 14:43:25 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3. Try to upgrade it.
WARNING 01-30 14:43:25 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
INFO 01-30 14:43:26 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
terminate called after throwing an instance of 'std::length_error'
what(): vector::reserve
Trying to run on 4xH100, with both
ghcr.io/modeltc/lightllm:mainandghcr.io/modeltc/lightllm:main-deepepimages.But get this error
WARNING 01-30 14:43:25 [sgl_utils.py:14] sgl_kernel is not installed, you can't use the api of it. You can solve it by running `pip install sgl_kernel`. WARNING 01-30 14:43:25 [sgl_utils.py:29] sgl_kernel is not installed, or the installed version did not support fa3. Try to upgrade it. WARNING 01-30 14:43:25 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it. INFO 01-30 14:43:26 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On terminate called after throwing an instance of 'std::length_error' what(): vector::reserve
Thank you for your feedback. We are currently fixing this issue. You can temporarily use this image : docker pull jyily/lightllm:cu129-78cc66a