amd
/

DeepSeek-R1-0528-MXFP4-MTP-MoEFP4

8-bit precision

Model card Files Files and versions

DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 / README.md

linzhao-amd's picture

Update README.md

0d5e992 verified 8 days ago

|

history blame contribute delete

3.65 kB

	---
	license: mit
	base_model:
	- deepseek-ai/DeepSeek-R1-0528
	---


	# Model Overview

	- Model Architecture: DeepSeek-R1-0528
	- Input: Text
	- Output: Text
	- Supported Hardware Microarchitecture: AMD MI350/MI355
	- ROCm: 7.0
	- PyTorch: 2.8.0
	- Transformers: 5.0.0
	- Operating System(s): Linux
	- Inference Engine: [SGLang](https://docs.sglang.ai/)/[vLLM](https://docs.vllm.ai/en/latest/)
	- Model Optimizer: [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.11)
	- Base model:
	- Weight quantization: self_attn Perchannel, FP8E4M3, Static; MOE OCP MXFP4, Static
	- Activation quantization: self_attn Pertoken, FP8E4M3, Dynamic; MOE OCP MXFP4, Dynamic
	- Mtp:
	- Weight quantization: self_attn Perchannel, FP8E4M3, Static; MOE OCP MXFP4, Static
	- Activation quantization: self_attn Pertoken, FP8E4M3, Dynamic; MOE OCP MXFP4, Dynamic
	- Calibration Dataset: [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)

	This model was built with deepseek-ai DeepSeek-R1-0528 model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for quantization.

	# Model Quantization

	The model was quantized from [deepseek-ai/DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). Both weights and activations were quantized.

	Preprocessing requirement:

	Before executing the quantization script below, the original FP8 model must first be dequantized to BFloat16.
	You can either perform the dequantization manually using this [conversion script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py), or use the pre-converted BFloat16 model available at [amd/DeepSeek-R1-0528-BF16](https://huggingface.co/amd/DeepSeek-R1-0528-BF16).

	Quantization scripts:
	```
	cd Quark/examples/torch/language_modeling/llm_ptq/
	export exclude_layers="mlp.gate. *lm_head model.layers.61.eh_proj model.layers.61.shared_head.head model.layers.61.embed_tokens"
	python3 quantize_quark.py --model_dir amd/DeepSeek-R1-0528-BF16 \
	--quant_scheme mxfp4 \
	--layer_quant_scheme 'self_attn' ptpc_fp8 \
	--exclude_layers $exclude_layers \
	--skip_evaluation \
	--model_export hf_format \
	--output_dir amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 \
	--multi_gpu
	```

	### Accuracy

	<table>
	<tr>
	<td><strong>Benchmark</strong>
	</td>
	<td><strong>DeepSeek-R1-0528</strong>
	</td>
	<td><strong>DeepSeek-R1-0528-MXFP4-MTP-MoEFP4(this model)</strong>
	</td>
	</tr>
	<tr>
	<td>GSM8K
	</td>
	<td>94.24
	</td>
	<td>94.90
	</td>
	</tr>
	</table>

	### Reproduction

	Docker image: rocm/vllm-dev:base_main_20260212

	Step 1: start a vLLM server with the quantized DeepSeek-R1 checkpoint

	```bash
	vllm serve amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 \
	--tensor-parallel-size 8 \
	--dtype auto \
	--speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
	--gpu-memory-utilization 0.9 \
	--block-size 1 \
	--trust-remote-code \
	--port 8000
	```
	Note: CLI parameters such as `--tensor-parallel-size`, `--gpu-memory-utilization`, and `--port` can be adjusted as needed to match the target runtime environment.

	Step 2: in a second terminal, run the GSM8K evaluation client against the running server.

	```bash
	python3 tests/evals/gsm8k/gsm8k_eval.py
	```

	# License
	Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.