Update README about vllm reproduction scripts
#3
by XuebinWang - opened
README.md
CHANGED
|
@@ -66,11 +66,34 @@ python3 quantize_quark.py --model_dir /amd/DeepSeek-R1-0528-BF16 \
|
|
| 66 |
</td>
|
| 67 |
<td>94.24
|
| 68 |
</td>
|
| 69 |
-
<td>
|
| 70 |
</td>
|
| 71 |
</tr>
|
| 72 |
</table>
|
| 73 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
# License
|
| 76 |
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.
|
|
|
|
| 66 |
</td>
|
| 67 |
<td>94.24
|
| 68 |
</td>
|
| 69 |
+
<td>94.90
|
| 70 |
</td>
|
| 71 |
</tr>
|
| 72 |
</table>
|
| 73 |
|
| 74 |
+
### Reproduction
|
| 75 |
+
|
| 76 |
+
Docker image: rocm/vllm-dev:base_main_20260212
|
| 77 |
+
|
| 78 |
+
Step 1: start a vLLM server with the quantized DeepSeek-R1 checkpoint
|
| 79 |
+
|
| 80 |
+
```bash
|
| 81 |
+
vllm serve amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 \
|
| 82 |
+
--tensor-parallel-size 8 \
|
| 83 |
+
--dtype auto \
|
| 84 |
+
--speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
|
| 85 |
+
--gpu-memory-utilization 0.9 \
|
| 86 |
+
--block-size 1 \
|
| 87 |
+
--trust-remote-code \
|
| 88 |
+
--port 8000
|
| 89 |
+
```
|
| 90 |
+
Note: CLI parameters such as `--tensor-parallel-size`, `--gpu-memory-utilization`, and `--port` can be adjusted as needed to match the target runtime environment.
|
| 91 |
+
|
| 92 |
+
Step 2: in a second terminal, run the GSM8K evaluation client against the running server.
|
| 93 |
+
|
| 94 |
+
```bash
|
| 95 |
+
python3 tests/evals/gsm8k/gsm8k_eval.py
|
| 96 |
+
```
|
| 97 |
|
| 98 |
# License
|
| 99 |
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.
|