xiaosa commited on
Commit
25fc680
·
verified ·
1 Parent(s): 01d2990

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -3
README.md CHANGED
@@ -4,11 +4,48 @@ base_model:
4
  - deepseek-ai/DeepSeek-R1
5
  base_model_relation: quantized
6
  ---
7
- # DeepSeek-R1-W4AFP8
8
 
9
- This model is a W4AFP8 quantized DeepSeek-R1 with AWQ quantizaton.
10
  Releated PR:https://github.com/sgl-project/sglang/pull/8573
11
- Releated Project: https://github.com/TMElyralab/sglang/tree/lyra_w4afp8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
  ------
14
  <!-- markdownlint-disable first-line-h1 -->
 
4
  - deepseek-ai/DeepSeek-R1
5
  base_model_relation: quantized
6
  ---
7
+ # DeepSeek-V3.1-W4AFP8
8
 
9
+ This model is a W4AFP8 quantized DeepSeek-V3.1 with AWQ quantizaton.
10
  Releated PR:https://github.com/sgl-project/sglang/pull/8573
11
+ Releated Project: https://github.com/TMElyralab/sglang/tree/lyra_w4afp8
12
+
13
+ ## Benchmark
14
+ Test configuration: input/output len = 1000/1000, qps=64, max_concurrency=64, num_prompt=128
15
+ Device: H20 * 8
16
+ Compared to the original model:
17
+ - bs=64,input/output throughput has increased by 56%.
18
+ - bs=128,input/output throughput has increased by 125%.
19
+
20
+ ```
21
+ ============ Serving Benchmark Result ============
22
+ Backend: sglang
23
+ Max request concurrency: 64
24
+ Successful requests: 128
25
+ Benchmark duration (s): 105.50
26
+ Total input tokens: 128000
27
+ Total generated tokens: 128000
28
+ Total generated tokens (retokenized): 127551
29
+ Request throughput (req/s): 1.21
30
+ Input token throughput (tok/s): 1213.24
31
+ Output token throughput (tok/s): 1213.24
32
+ Total token throughput (tok/s): 2426.49
33
+ Concurrency: 63.97
34
+ ----------------End-to-End Latency----------------
35
+ Mean E2E Latency (ms): 52728.31
36
+ Median E2E Latency (ms): 52728.33
37
+ ---------------Time to First Token----------------
38
+ Mean TTFT (ms): 5444.26
39
+ Median TTFT (ms): 5425.69
40
+ P99 TTFT (ms): 8768.54
41
+ ---------------Inter-Token Latency----------------
42
+ Mean ITL (ms): 47.33
43
+ Median ITL (ms): 44.18
44
+ P95 ITL (ms): 46.58
45
+ P99 ITL (ms): 46.76
46
+ Max ITL (ms): 7819.3
47
+ ==================================================
48
+ ```
49
 
50
  ------
51
  <!-- markdownlint-disable first-line-h1 -->