ekurtic commited on
Commit
e81f1b6
·
verified ·
1 Parent(s): 4b2253b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -0
README.md CHANGED
@@ -78,5 +78,17 @@ lighteval vllm $MODEL_ARGS "custom|aime24|0|0,custom|math_500|0|0,custom|gpqa:di
78
  ```
79
  Please use this version of vLLM: https://github.com/vllm-project/vllm/pull/16038
80
 
 
 
 
 
 
 
 
 
 
 
 
 
81
  ## Contributors
82
  Denis Kuznedelev (Yandex), Eldar Kurtić (Red Hat AI & ISTA), Jiale Chen (ISTA), Michael Goin (Red Hat AI), Elias Frantar (ISTA), Dan Alistarh (Red Hat AI & ISTA).
 
78
  ```
79
  Please use this version of vLLM: https://github.com/vllm-project/vllm/pull/16038
80
 
81
+ ## Performance benchmarking
82
+ We follow the standard vLLM performance benchmarking with ShareGPT dataset and observe the following metrics (lower is better):
83
+
84
+ | | Time to First Token<br>Median TTFT (ms) ↓ | Time per Output Token<br>Median TPOT (ms) ↓ | Inter-token Latency<br>Median ITL (ms) ↓ |
85
+ | -------------------------------------------- | :-------------------------------------: | :---------------------------------------: | :------------------------------------: |
86
+ | cognitivecomputations/DeepSeek-R1-AWQ | 1585.45 | 55.41 | 43.06 |
87
+ | ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts | 1344.68 | 41.49 | 36.33 |
88
+ | ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g | 815.19 | 44.65 | 37.88 |
89
+
90
+ GPTQ models are faster across all metrics than AWQ models because GPTQ uses less bits-per-parameter than AWQ. More specifically, AWQ has to use smaller group-size of 64 (vs 128 in GPTQ) to preserve accuracy, and zero-points due to asymmetric quantization.
91
+
92
+
93
  ## Contributors
94
  Denis Kuznedelev (Yandex), Eldar Kurtić (Red Hat AI & ISTA), Jiale Chen (ISTA), Michael Goin (Red Hat AI), Elias Frantar (ISTA), Dan Alistarh (Red Hat AI & ISTA).