QuantTrio
/

DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Lite

Text Generation

DeepSeek-R1-0528

text-generation-inference

4-bit precision

Model card Files Files and versions

tclf90 commited on Jun 1, 2025

Commit

e42e491

·

verified ·

1 Parent(s): fdbdfb0

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -15,9 +15,9 @@ base_model_relation: quantized
 # DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Lite
 Base mode [deepseek-ai/DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528)
-This model was converted into a 4-bit format following the default quantization layout of the vLLM model files. However, this configuration doesn't fully satisfy the computational requirements of the model and may result in issues during response generation (including but not limited to AWQ and GPTQ).
-Preliminary analysis suggests that per-layer quantization appears necessary. Here, we implemented minimal Int8 mixed precision processing to keep the file size expansion as small as possible beyond the base 4-bit version.
 Currently, vllm==0.9.0 does not support per-layer quantization settings for the moe module. I've provided a basic implementation by adding the get_moe_quant_method function within the gptq_marlin.py file. Before the PR is merged, please replace the corresponding file with the attached one.

 # DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Lite
 Base mode [deepseek-ai/DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528)
+This repository contains a mixed-precision (Int4 + selective Int8) GPTQ version of DeepSeek-R1-0528 for vLLM. We began with a standard 4-bit (AWQ/GPTQ) conversion that follows vLLM’s default quantization layout, but early tests showed that a fully-Int4 model could not meet the compute demands of this checkpoint and may produce unstable outputs.
+Guided by this preliminary analysis, we introduced targeted, per-layer Int8 refinement: only the layers most sensitive to quantization are stored in Int8, while the rest remain Int4. This keeps the file-size increase minimal compared with the pure 4-bit baseline while restoring response quality.
 Currently, vllm==0.9.0 does not support per-layer quantization settings for the moe module. I've provided a basic implementation by adding the get_moe_quant_method function within the gptq_marlin.py file. Before the PR is merged, please replace the corresponding file with the attached one.