Model Overview

  • Model Architecture: DeepSeek-R1-0528
    • Input: Text
    • Output: Text
  • Supported Hardware Microarchitecture: AMD MI350/MI355
  • ROCm: 7.0
  • PyTorch: 2.8.0
  • Transformers: 5.0.0
  • Operating System(s): Linux
  • Inference Engine: SGLang/vLLM
  • Model Optimizer: AMD-Quark (V0.11)
    • Base model:
      • Weight quantization: self_attn Perchannel, FP8E4M3, Static; MOE OCP MXFP4, Static
      • Activation quantization: self_attn Pertoken, FP8E4M3, Dynamic; MOE OCP MXFP4, Dynamic
    • Mtp:
      • Weight quantization: self_attn Perchannel, FP8E4M3, Static; MOE OCP MXFP4, Static
      • Activation quantization: self_attn Pertoken, FP8E4M3, Dynamic; MOE OCP MXFP4, Dynamic
  • Calibration Dataset: Pile

This model was built with deepseek-ai DeepSeek-R1-0528 model by applying AMD-Quark for quantization.

Model Quantization

The model was quantized from deepseek-ai/DeepSeek-R1-0528 using AMD-Quark. Both weights and activations were quantized.

Preprocessing requirement:

Before executing the quantization script below, the original FP8 model must first be dequantized to BFloat16. You can either perform the dequantization manually using this conversion script, or use the pre-converted BFloat16 model available at amd/DeepSeek-R1-0528-BF16.

Quantization scripts:

cd Quark/examples/torch/language_modeling/llm_ptq/
export exclude_layers="*mlp.gate.* *lm_head model.layers.61.eh_proj model.layers.61.shared_head.head model.layers.61.embed_tokens"
python3 quantize_quark.py --model_dir amd/DeepSeek-R1-0528-BF16 \
                          --quant_scheme mxfp4 \
                          --layer_quant_scheme '*self_attn*' ptpc_fp8 \
                          --exclude_layers $exclude_layers \
                          --skip_evaluation \
                          --model_export hf_format \
                          --output_dir amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP48 \
                          --multi_gpu

Accuracy

Benchmark DeepSeek-R1-0528 DeepSeek-R1-0528-MXFP4-MTP-MoEFP4(this model)
GSM8K 94.24 94.90

Reproduction

Docker image: rocm/vllm-dev:base_main_20260212

Step 1: start a vLLM server with the quantized DeepSeek-R1 checkpoint

vllm serve amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 \
  --tensor-parallel-size 8 \
  --dtype auto \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
  --gpu-memory-utilization 0.9 \
  --block-size 1 \
  --trust-remote-code \
  --port 8000

Note: CLI parameters such as --tensor-parallel-size, --gpu-memory-utilization, and --port can be adjusted as needed to match the target runtime environment.

Step 2: in a second terminal, run the GSM8K evaluation client against the running server.

python3 tests/evals/gsm8k/gsm8k_eval.py

License

Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.

Downloads last month
74
Safetensors
Model size
350B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4

Quantized
(47)
this model