Once again Thanks, here is my review for 8 x RTX 5090 setup

#2
by crystech - opened

for RTX 5090 setup , I observed slower TPS when enabled MTP
--speculative-config.method mtp
--speculative-config.num_speculative_tokens 1
Avg 25 TPS

With limited Vram of 256GB , I think RTX 5090 users should turn it off for more KV Cache and better performance?

Once I disabled MTP I attained 35-41 TPS (ubuntu GDM, I expect a little more TPS if headless setup)

---- Without MTP enabled -----
(APIServer pid=21036) INFO: 127.0.0.1:50526 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=21036) INFO 12-25 02:01:12 [loggers.py:257] Engine 000: Avg prompt throughput: 1383.7 tokens/s, Avg generation throughput: 13.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.1%, Prefix cache hit rate: 0.0%
(APIServer pid=21036) INFO 12-25 02:01:22 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 40.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.3%, Prefix cache hit rate: 0.0%
(APIServer pid=21036) INFO 12-25 02:01:32 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 41.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.6%, Prefix cache hit rate: 0.0%

---with MTP enabled ------
(APIServer pid=19990) INFO: Application startup complete.
(APIServer pid=19990) INFO: 127.0.0.1:35818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=19990) INFO 12-25 01:51:58 [loggers.py:257] Engine 000: Avg prompt throughput: 1260.1 tokens/s, Avg generation throughput: 13.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.3%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:51:58 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.89, Accepted throughput: 0.77 tokens/s, Drafted throughput: 0.86 tokens/s, Accepted: 62 tokens, Drafted: 70 tokens, Per-position acceptance rate: 0.886, Avg Draft acceptance rate: 88.6%
(APIServer pid=19990) INFO 12-25 01:52:08 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.6%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:52:08 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.78, Accepted throughput: 11.90 tokens/s, Drafted throughput: 15.30 tokens/s, Accepted: 119 tokens, Drafted: 153 tokens, Per-position acceptance rate: 0.778, Avg Draft acceptance rate: 77.8%
(APIServer pid=19990) INFO 12-25 01:52:18 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.8%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:52:18 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.78, Accepted throughput: 11.90 tokens/s, Drafted throughput: 15.20 tokens/s, Accepted: 119 tokens, Drafted: 152 tokens, Per-position acceptance rate: 0.783, Avg Draft acceptance rate: 78.3%
(APIServer pid=19990) INFO 12-25 01:52:28 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 28.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.0%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:52:28 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.79, Accepted throughput: 12.60 tokens/s, Drafted throughput: 15.90 tokens/s, Accepted: 126 tokens, Drafted: 159 tokens, Per-position acceptance rate: 0.792, Avg Draft acceptance rate: 79.2%
(APIServer pid=19990) INFO 12-25 01:52:38 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:52:38 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.00, Accepted throughput: 0.30 tokens/s, Drafted throughput: 0.30 tokens/s, Accepted: 3 tokens, Drafted: 3 tokens, Per-position acceptance rate: 1.000, Avg Draft acceptance rate: 100.0%
(APIServer pid=19990) INFO 12-25 01:52:48 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

am i doing anything wrong? just in case

Sign up or log in to comment