Update README.md
Browse files
README.md
CHANGED
|
@@ -17,7 +17,7 @@ Base mode [deepseek-ai/DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/Deep
|
|
| 17 |
|
| 18 |
This repository contains a mixed-precision (Int4 + selective Int8) GPTQ version of DeepSeek-R1-0528 for vLLM. We began with a standard 4-bit (AWQ/GPTQ) conversion that follows vLLM’s default quantization layout, but early tests showed that a fully-Int4 model could not meet the compute demands of this checkpoint and may produce unstable outputs.
|
| 19 |
|
| 20 |
-
Guided by this preliminary analysis, we introduced targeted, per-layer Int8 refinement: only the layers most sensitive to quantization are stored in Int8, while the rest remain Int4. This keeps the file-size increase minimal compared with the pure 4-bit baseline while restoring response quality.
|
| 21 |
|
| 22 |
Currently, vllm==0.9.0 does not support per-layer quantization settings for the moe module. I've provided a basic implementation by adding the get_moe_quant_method function within the gptq_marlin.py file. Before the PR is merged, please replace the corresponding file with the attached one.
|
| 23 |
|
|
|
|
| 17 |
|
| 18 |
This repository contains a mixed-precision (Int4 + selective Int8) GPTQ version of DeepSeek-R1-0528 for vLLM. We began with a standard 4-bit (AWQ/GPTQ) conversion that follows vLLM’s default quantization layout, but early tests showed that a fully-Int4 model could not meet the compute demands of this checkpoint and may produce unstable outputs.
|
| 19 |
|
| 20 |
+
Guided by this preliminary analysis, we introduced targeted, per-layer Int8 refinement: only the layers most sensitive to quantization are stored in Int8 (the compadct version has more int8 layers), while the rest remain Int4. This keeps the file-size increase minimal compared with the pure 4-bit baseline while restoring response quality.
|
| 21 |
|
| 22 |
Currently, vllm==0.9.0 does not support per-layer quantization settings for the moe module. I've provided a basic implementation by adding the get_moe_quant_method function within the gptq_marlin.py file. Before the PR is merged, please replace the corresponding file with the attached one.
|
| 23 |
|