Update README.md
Browse files
README.md
CHANGED
|
@@ -84,7 +84,7 @@ We follow the standard vLLM performance benchmarking with ShareGPT dataset and o
|
|
| 84 |
| | Time to First Token<br>Median TTFT (ms) ↓ | Time per Output Token<br>Median TPOT (ms) ↓ | Inter-token Latency<br>Median ITL (ms) ↓ |
|
| 85 |
| -------------------------------------------- | :-------------------------------------: | :---------------------------------------: | :------------------------------------: |
|
| 86 |
| cognitivecomputations/DeepSeek-R1-AWQ | 1585.45 | 55.41 | 43.06 |
|
| 87 |
-
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts | 1344.68 | 41.49 | 36.33 |
|
| 88 |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g | 815.19 | 44.65 | 37.88 |
|
| 89 |
|
| 90 |
GPTQ models are faster across all metrics than AWQ models because GPTQ uses less bits-per-parameter than AWQ. More specifically, AWQ has to use smaller group-size of 64 (vs 128 in GPTQ) to preserve accuracy, and zero-points due to asymmetric quantization.
|
|
|
|
| 84 |
| | Time to First Token<br>Median TTFT (ms) ↓ | Time per Output Token<br>Median TPOT (ms) ↓ | Inter-token Latency<br>Median ITL (ms) ↓ |
|
| 85 |
| -------------------------------------------- | :-------------------------------------: | :---------------------------------------: | :------------------------------------: |
|
| 86 |
| cognitivecomputations/DeepSeek-R1-AWQ | 1585.45 | 55.41 | 43.06 |
|
| 87 |
+
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts<br> **(this model)** | 1344.68 | 41.49 | 36.33 |
|
| 88 |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g | 815.19 | 44.65 | 37.88 |
|
| 89 |
|
| 90 |
GPTQ models are faster across all metrics than AWQ models because GPTQ uses less bits-per-parameter than AWQ. More specifically, AWQ has to use smaller group-size of 64 (vs 128 in GPTQ) to preserve accuracy, and zero-points due to asymmetric quantization.
|