TMElyralab
/

DeepSeek-R1-AWQ-W4AFP8

Text Generation

text-generation-inference

4-bit precision

Model card Files Files and versions

xiaosa commited on Aug 22, 2025

Commit

25fc680

·

verified ·

1 Parent(s): 01d2990

Update README.md

Files changed (1) hide show

README.md +40 -3

README.md CHANGED Viewed

@@ -4,11 +4,48 @@ base_model:
 - deepseek-ai/DeepSeek-R1
 base_model_relation: quantized
 ---
-# DeepSeek-R1-W4AFP8
-This model is a W4AFP8 quantized DeepSeek-R1 with AWQ quantizaton.
 Releated PR：https://github.com/sgl-project/sglang/pull/8573
-Releated Project: https://github.com/TMElyralab/sglang/tree/lyra_w4afp8
 ------
 <!-- markdownlint-disable first-line-h1 -->

 - deepseek-ai/DeepSeek-R1
 base_model_relation: quantized
 ---
+# DeepSeek-V3.1-W4AFP8
+This model is a W4AFP8 quantized DeepSeek-V3.1 with AWQ quantizaton.
 Releated PR：https://github.com/sgl-project/sglang/pull/8573
+Releated Project: https://github.com/TMElyralab/sglang/tree/lyra_w4afp8
+## Benchmark
+Test configuration: input/output len = 1000/1000, qps=64, max_concurrency=64, num_prompt=128
+Device: H20 * 8
+Compared to the original model：
+- bs=64，input/output throughput has increased by 56%.
+- bs=128，input/output throughput has increased by 125%.
+```
+============ Serving Benchmark Result ============
+Backend:                                 sglang
+Max request concurrency:                 64
+Successful requests:                     128
+Benchmark duration (s):                  105.50
+Total input tokens:                      128000
+Total generated tokens:                  128000
+Total generated tokens (retokenized):    127551
+Request throughput (req/s):              1.21
+Input token throughput (tok/s):          1213.24
+Output token throughput (tok/s):         1213.24
+Total token throughput (tok/s):          2426.49
+Concurrency:                             63.97
+----------------End-to-End Latency----------------
+Mean E2E Latency (ms):                   52728.31
+Median E2E Latency (ms):                 52728.33
+---------------Time to First Token----------------
+Mean TTFT (ms):                          5444.26
+Median TTFT (ms):                        5425.69
+P99 TTFT (ms):                           8768.54
+---------------Inter-Token Latency----------------
+Mean ITL (ms):                           47.33
+Median ITL (ms):                         44.18
+P95 ITL (ms):                            46.58
+P99 ITL (ms):                            46.76
+Max ITL (ms):                            7819.3
+==================================================
+```
 ------
 <!-- markdownlint-disable first-line-h1 -->