ISTA-DASLab
/

DeepSeek-R1-GPTQ-4b-128g-experts

Text Generation

text-generation-inference

compressed-tensors

Model card Files Files and versions

ekurtic commited on Apr 8, 2025

Commit

e81f1b6

·

verified ·

1 Parent(s): 4b2253b

Update README.md

Files changed (1) hide show

README.md +12 -0

README.md CHANGED Viewed

@@ -78,5 +78,17 @@ lighteval vllm $MODEL_ARGS "custom|aime24|0|0,custom|math_500|0|0,custom|gpqa:di
 ```
 Please use this version of vLLM: https://github.com/vllm-project/vllm/pull/16038
 ## Contributors
 Denis Kuznedelev (Yandex), Eldar Kurtić (Red Hat AI & ISTA), Jiale Chen (ISTA), Michael Goin (Red Hat AI), Elias Frantar (ISTA), Dan Alistarh (Red Hat AI & ISTA).

 ```
 Please use this version of vLLM: https://github.com/vllm-project/vllm/pull/16038
+## Performance benchmarking
+We follow the standard vLLM performance benchmarking with ShareGPT dataset and observe the following metrics (lower is better):
+|                                              | Time to First Token<br>Median TTFT (ms) ↓ | Time per Output Token<br>Median TPOT (ms) ↓ | Inter-token Latency<br>Median ITL (ms) ↓ |
+| -------------------------------------------- | :-------------------------------------: | :---------------------------------------: | :------------------------------------: |
+| cognitivecomputations/DeepSeek-R1-AWQ        | 1585.45                                 | 55.41                                     | 43.06                                  |
+| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts | 1344.68                                 | 41.49                                     | 36.33                                  |
+| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g         | 815.19                                  | 44.65                                     | 37.88                                  |
+GPTQ models are faster across all metrics than AWQ models because GPTQ uses less bits-per-parameter than AWQ. More specifically, AWQ has to use smaller group-size of 64 (vs 128 in GPTQ) to preserve accuracy, and zero-points due to asymmetric quantization.
 ## Contributors
 Denis Kuznedelev (Yandex), Eldar Kurtić (Red Hat AI & ISTA), Jiale Chen (ISTA), Michael Goin (Red Hat AI), Elias Frantar (ISTA), Dan Alistarh (Red Hat AI & ISTA).