Hanrui / sglang /docs /platforms /ascend_npu_quantization.md
Lekr0's picture
Add files using upload-large-folder tool
a227c91 verified

Quantization on Ascend.

To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there's no need to add --quantization argument when starting the engine. The quantization method will be automatically parsed from the downloaded quant_model_description.json or config.json config.

ModelSlim on Ascend support:

  • W4A4 dynamic linear
  • W8A8 static linear
  • W8A8 dynamic linear
  • W4A8 dynamic MOE
  • W8A8 dynamic MOE

AWQ on Ascend support:

  • W4A16 linear
  • W8A16 linear # Need to test
  • W4A16 MOE # Need to test

Compressed-tensors (LLM Compressor) on Ascend support: