Update README.md
Browse files
README.md
CHANGED
|
@@ -78,5 +78,17 @@ lighteval vllm $MODEL_ARGS "custom|aime24|0|0,custom|math_500|0|0,custom|gpqa:di
|
|
| 78 |
```
|
| 79 |
Please use this version of vLLM: https://github.com/vllm-project/vllm/pull/16038
|
| 80 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
## Contributors
|
| 82 |
Denis Kuznedelev (Yandex), Eldar Kurtić (Red Hat AI & ISTA), Jiale Chen (ISTA), Michael Goin (Red Hat AI), Elias Frantar (ISTA), Dan Alistarh (Red Hat AI & ISTA).
|
|
|
|
| 78 |
```
|
| 79 |
Please use this version of vLLM: https://github.com/vllm-project/vllm/pull/16038
|
| 80 |
|
| 81 |
+
## Performance benchmarking
|
| 82 |
+
We follow the standard vLLM performance benchmarking with ShareGPT dataset and observe the following metrics (lower is better):
|
| 83 |
+
|
| 84 |
+
| | Time to First Token<br>Median TTFT (ms) ↓ | Time per Output Token<br>Median TPOT (ms) ↓ | Inter-token Latency<br>Median ITL (ms) ↓ |
|
| 85 |
+
| -------------------------------------------- | :-------------------------------------: | :---------------------------------------: | :------------------------------------: |
|
| 86 |
+
| cognitivecomputations/DeepSeek-R1-AWQ | 1585.45 | 55.41 | 43.06 |
|
| 87 |
+
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts | 1344.68 | 41.49 | 36.33 |
|
| 88 |
+
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g | 815.19 | 44.65 | 37.88 |
|
| 89 |
+
|
| 90 |
+
GPTQ models are faster across all metrics than AWQ models because GPTQ uses less bits-per-parameter than AWQ. More specifically, AWQ has to use smaller group-size of 64 (vs 128 in GPTQ) to preserve accuracy, and zero-points due to asymmetric quantization.
|
| 91 |
+
|
| 92 |
+
|
| 93 |
## Contributors
|
| 94 |
Denis Kuznedelev (Yandex), Eldar Kurtić (Red Hat AI & ISTA), Jiale Chen (ISTA), Michael Goin (Red Hat AI), Elias Frantar (ISTA), Dan Alistarh (Red Hat AI & ISTA).
|