amd
/

DeepSeek-V3.2-mtp-ptpc

Model card Files Files and versions

DeepSeek-V3.2-mtp-ptpc / README.md

haoyang-amd's picture

Update README.md

5724ae7 verified 25 days ago

|

history blame contribute delete

2.59 kB

	---
	license: mit
	base_model:
	- deepseek-ai/DeepSeek-V3.2
	---

	Note that the MTP layers of this model are also PTPC-quantized.

	# Model Overview

	- Model Architecture: DeepSeek-V3.2
	- Input: Text
	- Output: Text
	- Supported Hardware Microarchitecture: AMD MI350/MI355
	- ROCm: 7.0
	- Operating System(s): Linux
	- Inference Engine: [SGLang](https://docs.sglang.ai/)/[vLLM](https://docs.vllm.ai/en/latest/)
	- Model Optimizer: [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.10)
	- Weight quantization: Perchannel, FP8E4M3, Static
	- Activation quantization: Pertoken, FP8E4M3, Dynamic
	- Calibration Dataset: [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)

	This model was built with deepseek-ai/DeepSeek-V3.2 model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for FP8E4M3 PTPC quantization.

	# Model Quantization

	The model was quantized from [deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights are quantized to FP8 and activations are quantized to FP8.


	### Accuracy

	<table>
	<tr>
	<td><strong>Benchmark</strong>
	</td>
	<td><strong>DeepSeek-V3.2</strong>
	</td>
	<td><strong>DeepSeek-V3.2-ptpc(this model)</strong>
	</td>
	</tr>
	<tr>
	<td>gsm8k
	</td>
	<td>96.00
	</td>
	<td>95.75
	</td>
	</tr>
	</table>

	### Reproduction

	Docker: rocm/vllm-private:rocm7.1_ubuntu22.04_vllm0.11.2_ptpc_fp8

	vllm version: 0.11.2.dev521+gad32e3e19.rocm710

	aiter version: 0.1.6.post2.dev55+g59bd8ff2c

	lm_eval version: 0.4.9.2
	```
	export VLLM_USE_V1=1
	export SAFETENSORS_FAST_GPU=1
	export VLLM_ROCM_USE_AITER=1
	export VLLM_ROCM_USE_AITER_MOE=1
	model_path="/model_path/deepseek-ai/DeepSeek-V3.2-ptpc"
	vllm serve $model_path \
	--tensor-parallel-size 8 \
	--data-parallel-size 1 \
	--max-num-batched-tokens 32768 \
	--trust-remote-code \
	--no-enable-prefix-caching \
	--disable-log-requests \
	--kv-cache-dtype bfloat16 \
	--gpu_memory_utilization 0.85 \
	--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
	--block-size 1

	lm_eval \
	--model local-completions \
	--tasks gsm8k \
	--model_args model=/model_path/deepseek-ai/DeepSeek-V3.2-ptpc,base_url=http://127.0.0.1:8000/v1/completions \
	--batch_size auto \
	--limit 400

	```

	# Deployment

	This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backends.

	# License
	Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.