File size: 4,325 Bytes
c9d5972
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d64bd5d
c9d5972
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
license: mit
base_model:
- deepseek-ai/DeepSeek-R1-0528
---


# Model Overview

- **Model Architecture:** DeepSeek-R1-0528
  - **Input:** Text
  - **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm**: 7.0
- **PyTorch**: 2.8.0
- **Transformers**: 4.53.0
- **Operating System(s):** Linux
- **Inference Engine:** [SGLang](https://docs.sglang.ai/)/[vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.10)
  - **Weight quantization:** OCP MXFP4, Static
  - **Activation quantization:** OCP MXFP4, Dynamic
- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)

This model was built with deepseek-ai DeepSeek-R1-0528 model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.

# Model Quantization

The model was quantized from [deepseek-ai/DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). Both weights and activations were quantized to MXFP4 format. 

**Preprocessing requirement:**

Before executing the quantization script below, the original FP8 model must first be dequantized to BFloat16.
You can either perform the dequantization manually using this [conversion script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py), or use the pre-converted BFloat16 model available at [amd/DeepSeek-R1-0528-BF16](https://huggingface.co/amd/DeepSeek-R1-0528-BF16).

**Quantization scripts:**
```
cd Quark/examples/torch/language_modeling/llm_ptq/
exclude_layers="*lm_head model.layers.61.*"
python3 quantize_quark.py --model_dir $MODEL_DIR \
                          --quant_scheme w_mxfp4_a_mxfp4 \
                          --group_size 32 \
                          --num_calib_data 128 \
                          --exclude_layers $exclude_layers \
                          --skip_evaluation \
                          --multi_gpu \
                          --model_export hf_format \
                          --output_dir amd/DeepSeek-R1-0528-MXFP4-V2
```

# Deployment

This model can be deployed efficiently using the [SGLang](https://docs.sglang.ai/) and [vLLM](https://docs.vllm.ai/en/latest/) backends.

## Evaluation

The model was evaluated on AIME24, and GSM8K benchmarks using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) framework. 

### Accuracy

<table>
  <tr>
   <td><strong>Benchmark</strong>
   </td>
   <td><strong>DeepSeek-R1-0528-MXFP4-V2 (non MTP) </strong>
   </td>
   <td><strong>DeepSeek-R1-0528-MXFP4-V2 (MTP=3)</strong>
   </td>
  </tr>
  <tr>
   <td>AIME24 
   </td>
   <td>80.00
   </td>
   <td>83.33
   </td>
  </tr>
  <tr>
   <td>GSM8K 
   </td>
   <td>95.00
   </td>
   <td>95.30
   </td>
  </tr>
</table>

### Reproduction

The results of AIME24 and GSM8K, were obtained using forked [lm-evaluation-harness](https://github.com/BowenBao/lm-evaluation-harness/tree/cot).

### Launch Server
```
#!/bin/bash
MODEL=/models/amd/DeepSeek-R1-0528-MXFP4-V2
LOG="sglang-serving.log"

SGLANG_AITER_MLA_PERSIST=1 \
python3 -m sglang.launch_server \
--model-path $MODEL \
--tensor-parallel-size 8 \
--trust-remote-code \
--chunked-prefill-size 131072 \
--host 0.0.0.0 \
--port 8321 \
--disable-radix-cache \
--mem-fraction-static 0.8 \
--max-running-requests 64 \
--attention-backend aiter 2>&1 | tee $LOG
```

### AIME24
```
lm_eval --model local-completions \
    --model_args model=/models/amd/DeepSeek-R1-0528-MXFP4-V2,base_url=http://0.0.0.0:8321/v1/completions,num_concurrent=999999,timeout=999999,tokenized_requests=False,max_length=32000,temperature=0.6,top_p=0.95 \
    --tasks aime24 \
    --num_fewshot 0 \
    --gen_kwargs "do_sample=True,temperature=0.6,top_p=0.95,max_tokens=32000" \
    --batch_size auto 2>&1 | tee aime24.log
```

### GSM8K
```
lm_eval --model local-completions \
    --model_args model=/models/amd/DeepSeek-R1-0528-MXFP4-V2,base_url=http://0.0.0.0:8321/v1/completions,num_concurrent=256,max_retries=10,max_gen_toks=2048,tokenized_requests=False \
    --tasks gsm8k \
    --num_fewshot 5 \
    --batch_size auto 2>&1 | tee gsm8k.log
```

# License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.