|
|
--- |
|
|
license: mit |
|
|
base_model: |
|
|
- deepseek-ai/DeepSeek-V3.2 |
|
|
--- |
|
|
|
|
|
**Note that the MTP layers of this model are also PTPC-quantized.** |
|
|
|
|
|
# Model Overview |
|
|
|
|
|
- **Model Architecture:** DeepSeek-V3.2 |
|
|
- **Input:** Text |
|
|
- **Output:** Text |
|
|
- **Supported Hardware Microarchitecture:** AMD MI350/MI355 |
|
|
- **ROCm**: 7.0 |
|
|
- **Operating System(s):** Linux |
|
|
- **Inference Engine:** [SGLang](https://docs.sglang.ai/)/[vLLM](https://docs.vllm.ai/en/latest/) |
|
|
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.10) |
|
|
- **Weight quantization:** Perchannel, FP8E4M3, Static |
|
|
- **Activation quantization:** Pertoken, FP8E4M3, Dynamic |
|
|
- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup) |
|
|
|
|
|
This model was built with deepseek-ai/DeepSeek-V3.2 model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for FP8E4M3 PTPC quantization. |
|
|
|
|
|
# Model Quantization |
|
|
|
|
|
The model was quantized from [deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights are quantized to FP8 and activations are quantized to FP8. |
|
|
|
|
|
|
|
|
### Accuracy |
|
|
|
|
|
<table> |
|
|
<tr> |
|
|
<td><strong>Benchmark</strong> |
|
|
</td> |
|
|
<td><strong>DeepSeek-V3.2</strong> |
|
|
</td> |
|
|
<td><strong>DeepSeek-V3.2-ptpc(this model)</strong> |
|
|
</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>gsm8k |
|
|
</td> |
|
|
<td>96.00 |
|
|
</td> |
|
|
<td>95.75 |
|
|
</td> |
|
|
</tr> |
|
|
</table> |
|
|
|
|
|
### Reproduction |
|
|
|
|
|
Docker: rocm/vllm-private:rocm7.1_ubuntu22.04_vllm0.11.2_ptpc_fp8 |
|
|
|
|
|
vllm version: 0.11.2.dev521+gad32e3e19.rocm710 |
|
|
|
|
|
aiter version: 0.1.6.post2.dev55+g59bd8ff2c |
|
|
|
|
|
lm_eval version: 0.4.9.2 |
|
|
``` |
|
|
export VLLM_USE_V1=1 |
|
|
export SAFETENSORS_FAST_GPU=1 |
|
|
export VLLM_ROCM_USE_AITER=1 |
|
|
export VLLM_ROCM_USE_AITER_MOE=1 |
|
|
model_path="/model_path/deepseek-ai/DeepSeek-V3.2-ptpc" |
|
|
vllm serve $model_path \ |
|
|
--tensor-parallel-size 8 \ |
|
|
--data-parallel-size 1 \ |
|
|
--max-num-batched-tokens 32768 \ |
|
|
--trust-remote-code \ |
|
|
--no-enable-prefix-caching \ |
|
|
--disable-log-requests \ |
|
|
--kv-cache-dtype bfloat16 \ |
|
|
--gpu_memory_utilization 0.85 \ |
|
|
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \ |
|
|
--block-size 1 |
|
|
|
|
|
lm_eval \ |
|
|
--model local-completions \ |
|
|
--tasks gsm8k \ |
|
|
--model_args model=/model_path/deepseek-ai/DeepSeek-V3.2-ptpc,base_url=http://127.0.0.1:8000/v1/completions \ |
|
|
--batch_size auto \ |
|
|
--limit 400 |
|
|
|
|
|
``` |
|
|
|
|
|
# Deployment |
|
|
|
|
|
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backends. |
|
|
|
|
|
# License |
|
|
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved. |