Small benchmark on 2 x RTX 5000 Pro Blackwell :)
#5
by AlexanderKHR25 - opened
Nice nice nice:
============ Serving Benchmark Result ============
Successful requests: 20
Output token throughput (tok/s): 152.8
Peak output token throughput (tok/s): 248.0
Time to First Token (mean): 178 ms
Time per Output Token (mean): 55.9 ms
P99 TTFT: 253 ms
P99 TPOT: 58 ms
Max concurrent requests: 17
CUDA_VISIBLE_DEVICES=0,1
HF_HOME=/data/huggingface
NCCL_P2P_DISABLE=1
NCCL_IB_DISABLE=1
NCCL_PROTO=Simple
NCCL_SOCKET_IFNAME=lo
NCCL_DEBUG=WARN
python -m vllm.entrypoints.openai.api_server
--model Benasd/Qwen2.5-VL-72B-Instruct-AWQ
--dtype float16
--quantization awq
--pipeline-parallel-size 2
--tensor-parallel-size 1
--max-model-len 6144
--gpu-memory-utilization 0.94
--enforce-eager
--host 0.0.0.0
--port 8004