Model Overview
- Model Architecture: DeepSeek-R1-0528
- Input: Text
- Output: Text
- Supported Hardware Microarchitecture: AMD MI350/MI355
- ROCm: 7.0
- PyTorch: 2.8.0
- Transformers: 5.0.0
- Operating System(s): Linux
- Inference Engine: SGLang/vLLM
- Model Optimizer: AMD-Quark (V0.11)
- Base model:
- Weight quantization: self_attn Perchannel, FP8E4M3, Static; MOE OCP MXFP4, Static
- Activation quantization: self_attn Pertoken, FP8E4M3, Dynamic; MOE OCP MXFP4, Dynamic
- Mtp:
- Weight quantization: self_attn Perchannel, FP8E4M3, Static; MOE OCP MXFP4, Static
- Activation quantization: self_attn Pertoken, FP8E4M3, Dynamic; MOE OCP MXFP4, Dynamic
- Base model:
- Calibration Dataset: Pile
This model was built with deepseek-ai DeepSeek-R1-0528 model by applying AMD-Quark for quantization.
Model Quantization
The model was quantized from deepseek-ai/DeepSeek-R1-0528 using AMD-Quark. Both weights and activations were quantized.
Preprocessing requirement:
Before executing the quantization script below, the original FP8 model must first be dequantized to BFloat16. You can either perform the dequantization manually using this conversion script, or use the pre-converted BFloat16 model available at amd/DeepSeek-R1-0528-BF16.
Quantization scripts:
cd Quark/examples/torch/language_modeling/llm_ptq/
export exclude_layers="*mlp.gate.* *lm_head model.layers.61.eh_proj model.layers.61.shared_head.head model.layers.61.embed_tokens"
python3 quantize_quark.py --model_dir amd/DeepSeek-R1-0528-BF16 \
--quant_scheme mxfp4 \
--layer_quant_scheme '*self_attn*' ptpc_fp8 \
--exclude_layers $exclude_layers \
--skip_evaluation \
--model_export hf_format \
--output_dir amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP48 \
--multi_gpu
Accuracy
| Benchmark | DeepSeek-R1-0528 | DeepSeek-R1-0528-MXFP4-MTP-MoEFP4(this model) |
| GSM8K | 94.24 | 94.90 |
Reproduction
Docker image: rocm/vllm-dev:base_main_20260212
Step 1: start a vLLM server with the quantized DeepSeek-R1 checkpoint
vllm serve amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 \
--tensor-parallel-size 8 \
--dtype auto \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
--gpu-memory-utilization 0.9 \
--block-size 1 \
--trust-remote-code \
--port 8000
Note: CLI parameters such as --tensor-parallel-size, --gpu-memory-utilization, and --port can be adjusted as needed to match the target runtime environment.
Step 2: in a second terminal, run the GSM8K evaluation client against the running server.
python3 tests/evals/gsm8k/gsm8k_eval.py
License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.
- Downloads last month
- 74
Model tree for amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4
Base model
deepseek-ai/DeepSeek-R1-0528