ISTA-DASLab
/

DeepSeek-R1-GPTQ-4b-128g-experts

Text Generation

text-generation-inference

compressed-tensors

Model card Files Files and versions

ekurtic commited on Apr 8, 2025

Commit

ffb8276

·

verified ·

1 Parent(s): e81f1b6

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -84,7 +84,7 @@ We follow the standard vLLM performance benchmarking with ShareGPT dataset and o
 |                                              | Time to First Token<br>Median TTFT (ms) ↓ | Time per Output Token<br>Median TPOT (ms) ↓ | Inter-token Latency<br>Median ITL (ms) ↓ |
 | -------------------------------------------- | :-------------------------------------: | :---------------------------------------: | :------------------------------------: |
 | cognitivecomputations/DeepSeek-R1-AWQ        | 1585.45                                 | 55.41                                     | 43.06                                  |
-| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts | 1344.68                                 | 41.49                                     | 36.33                                  |
 | ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g         | 815.19                                  | 44.65                                     | 37.88                                  |
 GPTQ models are faster across all metrics than AWQ models because GPTQ uses less bits-per-parameter than AWQ. More specifically, AWQ has to use smaller group-size of 64 (vs 128 in GPTQ) to preserve accuracy, and zero-points due to asymmetric quantization.

 |                                              | Time to First Token<br>Median TTFT (ms) ↓ | Time per Output Token<br>Median TPOT (ms) ↓ | Inter-token Latency<br>Median ITL (ms) ↓ |
 | -------------------------------------------- | :-------------------------------------: | :---------------------------------------: | :------------------------------------: |
 | cognitivecomputations/DeepSeek-R1-AWQ        | 1585.45                                 | 55.41                                     | 43.06                                  |
+| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts<br> **(this model)** | 1344.68                                 | 41.49                                     | 36.33                                  |
 | ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g         | 815.19                                  | 44.65                                     | 37.88                                  |
 GPTQ models are faster across all metrics than AWQ models because GPTQ uses less bits-per-parameter than AWQ. More specifically, AWQ has to use smaller group-size of 64 (vs 128 in GPTQ) to preserve accuracy, and zero-points due to asymmetric quantization.