Model Overview

Model Architecture: DeepSeek-R1-0528
- Input: Text
- Output: Text
Supported Hardware Microarchitecture: AMD MI350/MI355
ROCm: 7.0
PyTorch: 2.8.0
Transformers: 5.0.0
Operating System(s): Linux
Inference Engine: SGLang/vLLM
Model Optimizer: AMD-Quark (V0.11)
- Base model:
  - Weight quantization: self_attn Perchannel, FP8E4M3, Static; MOE OCP MXFP4, Static
  - Activation quantization: self_attn Pertoken, FP8E4M3, Dynamic; MOE OCP MXFP4, Dynamic
- Mtp:
  - Weight quantization: self_attn Perchannel, FP8E4M3, Static; MOE OCP MXFP4, Static
  - Activation quantization: self_attn Pertoken, FP8E4M3, Dynamic; MOE OCP MXFP4, Dynamic
Calibration Dataset: Pile

This model was built with deepseek-ai DeepSeek-R1-0528 model by applying AMD-Quark for quantization.

Model Quantization

The model was quantized from deepseek-ai/DeepSeek-R1-0528 using AMD-Quark. Both weights and activations were quantized.

Preprocessing requirement:

Before executing the quantization script below, the original FP8 model must first be dequantized to BFloat16. You can either perform the dequantization manually using this conversion script, or use the pre-converted BFloat16 model available at amd/DeepSeek-R1-0528-BF16.

Quantization scripts:

cd Quark/examples/torch/language_modeling/llm_ptq/
export exclude_layers="*mlp.gate.* *lm_head model.layers.61.eh_proj model.layers.61.shared_head.head model.layers.61.embed_tokens"
python3 quantize_quark.py --model_dir amd/DeepSeek-R1-0528-BF16 \
                          --quant_scheme mxfp4 \
                          --layer_quant_scheme '*self_attn*' ptpc_fp8 \
                          --exclude_layers $exclude_layers \
                          --skip_evaluation \
                          --model_export hf_format \
                          --output_dir amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 \
                          --multi_gpu

Accuracy

Benchmark	DeepSeek-R1-0528	DeepSeek-R1-0528-MXFP4-MTP-MoEFP4(this model)
GSM8K	94.24	94.90

Reproduction

Docker image: rocm/vllm-dev:base_main_20260212

Step 1: start a vLLM server with the quantized DeepSeek-R1 checkpoint

vllm serve amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 \
  --tensor-parallel-size 8 \
  --dtype auto \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
  --gpu-memory-utilization 0.9 \
  --block-size 1 \
  --trust-remote-code \
  --port 8000

Note: CLI parameters such as --tensor-parallel-size, --gpu-memory-utilization, and --port can be adjusted as needed to match the target runtime environment.

Step 2: in a second terminal, run the GSM8K evaluation client against the running server.

python3 tests/evals/gsm8k/gsm8k_eval.py

License

Downloads last month: 23,028

Safetensors

Model size

350B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4

Base model

deepseek-ai/DeepSeek-R1-0528

Finetuned

(60)

this model