QuantTrio
/

DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Medium

Text Generation

DeepSeek-R1-0528

text-generation-inference

4-bit precision

Model card Files Files and versions

JunHowie commited on Jun 20, 2025

Commit

b87222b

·

verified ·

1 Parent(s): d8e48ad

Update README.md

Files changed (1) hide show

README.md +46 -1

README.md CHANGED Viewed

@@ -34,8 +34,53 @@ Variant Overview
 Choose the variant that best matches your hardware and quality requirements.
 ### 【Model Update Date】
-```
 2025-06-04
 1. fast commit
 ```

 Choose the variant that best matches your hardware and quality requirements.
+### 【Vllm 单机（8x141GB）启动命令】
+```
+MAX_REQUESTS=512
+CONTEXT_LEN=163840
+python3 -m vllm.entrypoints.openai.api_server \
+  --model .../QuantTrio/DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Medium \
+  --served-model-name QuantTrio/DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Medium \
+  --swap-space 16 \
+  --tensor-parallel-size 8 \
+  --gpu-memory-utilization 0.95 \
+  --max-num-seqs $MAX_REQUESTS \
+  --max-seq-len-to-capture $CONTEXT_LEN \
+  --max-model-len $CONTEXT_LEN \
+  --enable-auto-tool-choice \
+  --tool-call-parser deepseek_v3 \
+  --chat-template tool_chat_template_deepseekr1.jinja \
+  --disable-log-requests \
+  --host 0.0.0.0 \
+  --port 8000
+```
+### 【H200 throughput performance】
+1. `8 × H200 (141 GB)`、 `context = 163840 tokens`
+| concurrent reqs | total tok/s | tok/s per req |
+|-----------------|-------------|---------------|
+| 1               | 60          | 60.0          |
+| 50              | 1350        | 27.0          |
+| 100             | 2200        | 22.0          |
+| 200             | 3400        | 17.0          |
+| 400             | 5100        | 12.7          |
+2. `4 × H200 (141 GB)`、 `context = 63840 tokens`
+| concurrent reqs | total tok/s | tok/s per req |
+|-----------------|-------------|---------------|
+| 1               | 56          | 56.0          |
+| 50              | 1100        | 22.0          |
+| 100             | 1700        | 17.0          |
+| 200             | 2600        | 13.0          |
+| 400             | 3900        | 9.7           |
 ### 【Model Update Date】
+```
+2025-06-20
+Added vLLM launch example (single node with 8 × H200 / 141 GB) and corresponding concurrency throughput benchmark data.
 2025-06-04
 1. fast commit
 ```