why so slow

#7
by jasonZhang1 - opened

4 A100 40G run glm-4.7-flash and GLM-4.7-Flash-FP8-Dynamic
i used vllm docker container
this is glm-4.7-flash:
(APIServer pid=14926) INFO 02-04 01:23:24 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 45.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 86.3%
(APIServer pid=14926) INFO 02-04 01:23:24 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.49, Accepted throughput: 14.80 tokens/s, Drafted throughput: 30.30 tokens/s, Accepted: 148 tokens, Drafted: 303 tokens, Per-position acceptance rate: 0.488, Avg Draft acceptance rate: 48.8%
(APIServer pid=14926) INFO 02-04 01:23:34 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 46.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 86.3%
(APIServer pid=14926) INFO 02-04 01:23:34 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.50, Accepted throughput: 15.40 tokens/s, Drafted throughput: 30.80 tokens/s, Accepted: 154 tokens, Drafted: 308 tokens, Per-position acceptance rate: 0.500, Avg Draft acceptance rate: 50.0%

this is GLM-4.7-Flash-FP8-Dynamic:
(APIServer pid=6364) INFO 02-04 01:23:35 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 29.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
(APIServer pid=6364) INFO 02-04 01:23:35 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 29.10 tokens/s, Accepted: 0 tokens, Drafted: 291 tokens, Per-position acceptance rate: 0.000, Avg Draft acceptance rate: 0.0%
(APIServer pid=6364) INFO 02-04 01:23:45 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 29.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
(APIServer pid=6364) INFO 02-04 01:23:45 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 29.80 tokens/s, Accepted: 0 tokens, Drafted: 298 tokens, Per-position acceptance rate: 0.000, Avg Draft acceptance rate: 0.0%
(APIServer pid=6364) INFO 02-04 01:23:55 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 30.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
(APIServer pid=6364) INFO 02-04 01:23:55 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 30.00 tokens/s, Accepted: 0 tokens, Drafted: 300 tokens, Per-position acceptance rate: 0.000, Avg Draft acceptance rate: 0.0%
Why is it slower

Sign up or log in to comment