File size: 5,313 Bytes
4017f50 d165184 4017f50 d165184 4017f50 b9c0909 4017f50 566ab5a 4017f50 8c1b558 23d2de6 8c1b558 ddb32d4 8c1b558 4f7e69c 8c1b558 d809e62 77de8ba d809e62 8c1b558 d809e62 8c1b558 23d2de6 8c1b558 4017f50 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
---
license: mit
base_model:
- deepseek-ai/DeepSeek-R1
---
# Model Overview
- **Model Architecture:** DeepSeek-R1
- **Input:** Text
- **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm**: 7.0
- **PyTorch**: 2.8.0
- **Transformers**: 4.53.0
- **Operating System(s):** Linux
- **Inference Engine:** [SGLang](https://docs.sglang.ai/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.10)
- **Weight quantization:** OCP MXFP4, Static
- **Activation quantization:** OCP MXFP4, Dynamic
- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
This model was built with deepseek-ai DeepSeek-R1 model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.
# Model Quantization
The model was quantized from [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). Both weights and activations were quantized to MXFP4 format, and the AutoSmoothQuant algorithm was applied to enhance accuracy.
**Preprocessing requirement:**
Before executing the quantization script below, the original FP8 model must first be dequantized to BFloat16.
You can either perform the dequantization manually using this [conversion script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py), or use the pre-converted BFloat16 model available at [unsloth/DeepSeek-R1-BF16](https://huggingface.co/unsloth/DeepSeek-R1-BF16).
**Quantization scripts:**
```
cd Quark/examples/torch/language_modeling/llm_ptq/
exclude_layers="*self_attn* *mlp.gate.* *lm_head"
python3 quantize_quark.py --model_dir $MODEL_DIR \
--quant_scheme w_mxfp4_a_mxfp4 \
--group_size 32 \
--num_calib_data 128 \
--exclude_layers $exclude_layers \
--skip_evaluation \
--multi_gpu \
--quant_algo autosmoothquant \
--model_export hf_format \
--output_dir amd/DeepSeek-R1-MXFP4-ASQ
```
# Deployment
### Use with SGLang
This model can be deployed efficiently using the [SGLang](https://docs.sglang.ai/) backend.
## Evaluation
The model was evaluated on reasoning tasks including AIME24, MMLU_COT, and GSM8K via [forked lm-evaluation-harness](https://github.com/BowenBao/lm-evaluation-harness/tree/cot) .
### Accuracy
<table>
<tr>
<td><strong>Benchmark</strong>
</td>
<td><strong>DeepSeek-R1 </strong>
</td>
<td><strong>DeepSeek-R1-MXFP4-ASQ(this model)</strong>
</td>
<td><strong>Recovery</strong>
</td>
</tr>
<tr>
<td>AIME24
</td>
<td>78.0
</td>
<td>76.0
</td>
<td>97.44%
</td>
</tr>
<tr>
<td>MMLU_COT
</td>
<td>79.90
</td>
<td>79.65
</td>
<td>99.69%
</td>
</tr>
<tr>
<td>GSM8K
</td>
<td>95.81
</td>
<td>95.42
</td>
<td>99.59%
</td>
</tr>
</table>
### Reproduction
The results of AIME24 and MMLU_COT were obtained using [SGLang](https://docs.sglang.ai/) while result of GSM8K was obtained using [vLLM](https://docs.vllm.ai/en/latest/). All the evaluations were conducted via forked [lm-evaluation-harness](https://github.com/BowenBao/lm-evaluation-harness/tree/cot).
### AIME24
```
# Launching server
python3 -m sglang.launch_server \
--model amd/DeepSeek-R1-MXFP4-ASQ \
--tp 8 \
--trust-remote-code \
--n-share-experts-fusion 8 \
--disable-radix-cache
# Evaluating
lm_eval --model local-completions \
--model_args model=amd/DeepSeek-R1-MXFP4-ASQ,base_url=http://localhost:30000/v1/completions,num_concurrent=999999,timeout=999999,tokenized_requests=False,max_length=32000,temperature=0.6,top_p=0.95 \
--tasks aime24 \
--num_fewshot 0 \
--gen_kwargs "do_sample=True,temperature=0.6,top_p=0.95,max_tokens=32000" \
--batch_size auto \
--log_samples \
--output_path output_data/aime24 2>&1 | tee logs/aime24.log
```
### MMLU_COT
```
# Launching server
python3 -m sglang.launch_server \
--model amd/DeepSeek-R1-MXFP4-ASQ \
--tp 8 \
--trust-remote-code \
--chunked-prefill-size 32768 \
--mem-fraction-static 0.83
# Evaluating
lm_eval --model local-completions \
--model_args model=amd/DeepSeek-R1-MXFP4-ASQ,base_url=http://localhost:30000/v1/completions,num_concurrent=999999,timeout=999999,tokenized_requests=False,max_length=32000,temperature=0.6,top_p=0.95 \
--tasks mmlu_cot \
--num_fewshot 0 \
--gen_kwargs "do_sample=True,temperature=0.6,top_p=0.95,max_tokens=32000" \
--batch_size auto \
--log_samples \
--output_path output_data/mmmlu_cot 2>&1 | tee logs/mmmlu_cot.log
```
### GSM8K
```
lm_eval --model local-completions \
--model_args model=amd/DeepSeek-R1-MXFP4-ASQ,base_url=http://localhost:30000/v1/completions,num_concurrent=999999,timeout=999999,tokenized_requests=False,max_length=8096 \
--tasks gsm8k \
--num_fewshot 5 \
--batch_size auto \
--log_samples \
--output_path output_data/gsm8k 2>&1 | tee logs/gsm8k.log
```
# License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved. |