Quantization on Ascend.
To load already quantized models, simply load the model weights and config. Again, if the model has been quantized offline, there's no need to add --quantization argument when starting the engine. The quantization method will be automatically parsed from the downloaded quant_model_description.json or config.json config.
- W4A4 dynamic linear
- W8A8 static linear
- W8A8 dynamic linear
- W4A8 dynamic MOE
- W8A8 dynamic MOE
- W4A16 linear
- W8A16 linear # Need to test
- W4A16 MOE # Need to test
Compressed-tensors (LLM Compressor) on Ascend support: